[02:30:19] shdubsh: no, why? [03:42:45] marostegui: if we're reading dbctl correctly, it's depooled and db1111 has taken most of the load. c.danis adjusted the weights for s8 earlier today to mitigate MW timeouts [03:43:10] marostegui: https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation?panelId=21&fullscreen&orgId=1&from=1588045805874&to=1588131600267 [03:44:43] there is a bit more context in -sre [05:27:16] shdubsh: thanks,I will check why db1114 is depooled [05:57:27] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) @Bstorm we can do a few things I think: 1) Firmware upgrades 2) Re-create the RAID (RAID10, strip size 256KB) 3) Try again Buster + 10.4 and see if it crashes, maybe it is ju... [06:03:09] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) [06:35:39] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1105.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20200429... [06:42:16] 10DBA, 10Upstream: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489 (10Marostegui) 10.4.13 is about to be released (it was scheduled for 27th) but looks like this issue won't make it to that release, despite what the developer initially said - it seems they MyIS... [06:43:42] 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui) Looks like this is included in the 10.4.13 release finally: https://jira.mariadb.org/projects/MDEV/versions/24223 not closing until I can confirm this is indeed included. [06:52:55] morning marostegui ! :) [06:53:51] hey addshore [06:53:59] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1105.eqiad.wmnet'] ` and were **ALL** successful. [06:54:16] so db1114 wanst meant to be depo0oled? :D [06:54:23] yep [06:54:35] feww, I though things dramaticaly got worse this week! [06:54:46] good news is, without 2 hosts, everything still runs, just is slow :P [06:54:49] hehe, no, it was depooled by mistake [06:54:59] and only slow for ~ a 3 hour period in a 24 hour window [06:55:13] maybe a 4-5 hour period [06:55:33] yeah, with 2 hosts out it gets degraded [06:55:39] I guess with 3 hosts out it would just die? [06:56:08] im not sure, i think it might actually survive, just be even slower, I think db1111 would need reduce load in that case [06:56:34] but the slower the dbs, the slower everything else, so it results in dramaticaly less edits, etc which has knock on reductions in reading for many things [06:56:42] Once we are on 10.4, CPU will be much better there [06:56:48] As we have seen with those 10.4 specific hosts [06:56:52] :D I look forward to it :) [06:57:04] marostegui: https://phabricator.wikimedia.org/T186188#5821983 is the dependency right, DC failover depends on master failovers, or is the relationship reversed? [06:57:32] jynus: Not sure what you mean? [06:58:04] or maybe "it is a subtask/step of failovers"? [06:58:16] ah you are talking about parents/subtasks? [06:58:44] yes [06:59:04] jynus: I don't know, there's lots of confusion with that sort of understanding, feel free to change it [06:59:10] nah [06:59:15] I just want to understand [06:59:40] is master failover blocked on dc failover? [06:59:44] or what is the status? [07:00:07] it is not blocked on that, it would be nice to do them once eqiad is passive but it is not hard blocked on that [07:00:24] you mean the row D maintenance? [07:00:40] sorry jynus - I will get back to you in a bit, I am busy with something else [07:00:41] (the thing that is "nice to do"? [07:00:43] ok [07:05:12] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10jcrespo) I am ready to do a logical backup of a labsdb host into dbprov1001 and load it into labsdb1011. This will take a long time, at least 3 days or more, but it would a) allow taking... [07:14:12] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) +1 to take a logical from labsdb1012 [07:16:13] jynus: so the network maintenance on row D: there is no date for that and I am not even sure it will happen before summer. What we have to do is move some of the masters outside of row D (we have 4 there, and we should spread them). The idea was to do that with the DC switchover, but looks like it won't happen, so we could do that on our own, but as we are trying to reduce the risky operations, that task is blocked on us [07:16:13] really [07:20:17] so I would like to stall the ticket [07:20:34] and make it blocked on the swithover [07:20:50] it is not blocked on the switchover really [07:21:00] then only stall the ticket :-D [07:21:10] and remove the relation :-D [07:21:17] up to you [07:29:46] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10jcrespo) 05Open→03Stalled Stalling as per Marostegui's updates. Not really blocked on switchover anymore. [07:30:22] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10jcrespo) [07:30:48] 10DBA: Switchover s8 primary database master db1109 -> db1104 - Date TBD - https://phabricator.wikimedia.org/T239238 (10jcrespo) 05Open→03Stalled [07:30:53] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10jcrespo) [07:30:55] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10jcrespo) [07:46:23] Last snapshot for s8 at codfw (db2100.codfw.wmnet:3318) taken on 2020-04-28 20:42:07 is 1089 GB, but previous one was 1620 GB, a change of 32.8% [07:46:30] ^addshore [07:46:59] that's the wb_term drop [07:47:06] yeah, I imagined that [07:47:12] wanted to share the difference [07:47:20] ah ok ok [07:48:47] 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10Marostegui) 05Open→03Resolved a:03Marostegui All the hosts are now... [07:48:56] that is 600GB after gzip compression difference [07:49:33] enough to store a new enwiki, or a third of otrs! [07:53:15] I am going to restart db1111's prometheus mysqld exporter, seems dead [07:53:22] ok [07:54:09] now it works again [07:59:14] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Restart extension1 (x1) database primary master (db1120) - https://phabricator.wikimedia.org/T250701 (10Marostegui) I have updated the mysql package to 10.1.43-2 Tomorrow I will issue the restart + mysql_upgrade [08:36:10] 10DBA, 10Operations, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10Dzahn) [10:37:11] 10DBA, 10Operations: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10Kormat) [10:37:54] 10DBA, 10Operations: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10Kormat) [10:38:00] marostegui: ^ [10:38:05] thanks [10:39:12] finding a permalink to a line in a file in gerrit sucks :P [10:41:54] you can use: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/install_server/files/autoinstall/netboot.cfg#84 [10:42:12] that's not a _permalink_ though. [10:42:25] if someone adds or deletes a few lines, the issue will then point to the wrong place [10:42:39] you can point to an specific commit though [10:42:41] (at least) [10:43:05] i eventually figured it out - check the first link in the task [10:43:18] ah yeah [10:43:24] I didn't read the task yet :p [10:43:29] haha [10:45:55] kormat: for sharing code [10:46:06] phabricator diffusion is much better [10:46:13] or just github [10:47:21] I find phab diffusion hard to use, I prefer github yeah [10:49:51] i can't even figure out how to use phab diffusion [10:50:00] XDDD [10:50:24] link to a single line on HEAD: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/mariadb/manifests/init.pp$3 [10:51:17] i think i need to bookmark https://phabricator.wikimedia.org/source/operations-puppet/, that took ages to find [10:51:35] that's a good tip [10:51:52] git grepping is sometimes the way to go [10:52:00] but other times the browser works better [10:52:54] kormat: depending on how much you care about your privacy it's also mirrored to github ;) [10:53:07] yeah, we commented that before :-D [10:53:19] [12:46] or just github [10:53:54] phab diffusion appears to be exactly as useless at this as gerrit/gittles. github it is ;) [10:53:59] ha ha [10:54:19] I don't know, I like to use it when commenting on tickets [10:54:40] "this is bad because X, Y, Z" [10:54:57] but link to gerrit if there is an ongoing patch [10:55:24] on the other hand, conversations and much better on phabricator than gerrit [12:33:50] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2087.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202004291233... [12:54:13] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2087.codfw.wmnet'] ` and were **ALL** successful. [13:18:54] 10DBA, 10Cloud-Services: Prepare and check storage layer for awawiki - https://phabricator.wikimedia.org/T251410 (10Zoranzoki21) [13:19:35] things that give me nightmares: the number of times i have had to type `db2087` in this reimaging process. [13:19:49] make a typo in any one of those places, and at the very least you're going to get a surprise [13:20:02] 10DBA, 10Cloud-Services: Prepare and check storage layer for awawiki - https://phabricator.wikimedia.org/T251410 (10Marostegui) p:05Triage→03Medium Let us know when the database is created so we can sanitize it. [13:22:19] shouldn't be just the one? [13:23:30] I guess you last connect to it manually later [13:23:42] and before [13:24:10] and the depool/repool, ofc [13:24:32] yeah, that should be just 1 script [13:25:25] we'll lets see. !log message, 2 puppet changes (across 3 files), dbctl to depool, connect to host (to stop mariadb etc), connect to mgmt interface, wmf-auto-reimage, remove+readd to tendril, re-ssh to host, repool [13:25:36] all of those steps involve the hostname [13:26:13] yeah, that is 1 script + fix of a partman recipe [13:26:30] marostegui has threatened to write one in bash ;) [13:26:42] the puppet change I think it is in purpose [13:26:52] for good reasons [13:27:13] people at app servers have it fully automated [13:27:23] we can do it too, just needs more work [13:28:16] you should comment it with manuel, but if I were tasked to mass-reimage [13:28:26] I would start by creating a proper partman recipe [13:36:15] but it's not just as simple, we would need error detection, pooling state errors, etc. [13:36:37] wmf-auto-reimage took at least 2 years to be usuable for me [13:37:18] due to all hw race conditions and errors and variantes [13:37:22] hw is hard :-D [13:43:40] the other issues is that 1 out of 2 times, the servers don't reboot, because they have some hw or bios issue, or something [13:44:03] lovely [13:47:21] icinga is saying that SMART is failing for db2087. running `smart-data-dump` fails with a python exception on the host [13:47:51] https://phabricator.wikimedia.org/P11084 [13:48:00] could that be what's making icinga unhappy, and any idea how to fix it? [13:48:16] it was updated in the last few days, cc shdubsh ^^^ [13:48:39] and godog too given it's too early for PST [13:56:14] kormat, volans: thanks for the report. Will have a look today. [14:11:27] i'm going to downtime the SMART check on that host for 24h, so i can continue with the rest of it. [14:12:31] shdubsh: i'll create a task for the problem, so it can be tracked. [14:15:51] https://phabricator.wikimedia.org/T251413 created - i have zero idea what tags to put on it [14:17:06] kormat: when in doubt for infra stuff operations is always a good start [14:17:36] and might be the only one needed in this case as it's not specific to observability [14:28:23] aand now icinga says smart is happy on that host. whyy [14:31:28] if I have to bet I'd say second puppet run fixed it [14:32:07] if that's the case (we can check puppetboard in a sec) that's a smell of an issue with our puppetization that is not setting up all correctly in the first pass [14:32:34] often due to resource ordering, but that's just a bet, don't quote me on this :) [14:33:11] 10DBA, 10Operations: PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10jcrespo) [14:35:12] I don't see an immediate evidence of that so far though [14:36:03] 10DBA, 10DC-Ops, 10Operations: PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10jcrespo) [14:40:51] kormat: you know what's weird, re-looking at your paste, I noticed that the facter version reported a date there [14:41:12] volans: yeah i figure out the issue there, and updated the task i filed [14:41:20] T251413 [14:41:21] T251413: smart-data-dump fails with ValueError when trying to parse a date - https://phabricator.wikimedia.org/T251413 [14:41:52] ah, it doesn't for me :) [14:41:55] nice catch! [14:42:48] that should be an easy fix, right? [14:42:53] so the smart data dump runs at minute 50 [14:43:07] jynus: yes. smart-data-dump should enforce a consistent locale [14:43:09] are you thinking just a depends? [14:43:16] for riccardo [14:43:22] rather than a sw fix? [14:43:24] (and i should fix my env) [14:43:29] if that was in the middle of the first puppet run it might explain it [14:43:56] not sure, it depends how it happened, if it fixed itself smells like some sort of race during the installation [14:43:59] not sure [14:44:27] i'd love to see a dashboard for SMART metrics, but i'm not finding one [15:06:23] kormat: I don't think one exists (yet!) [15:06:44] ah, damn :) [16:04:26] marostegui: btw, I don't know if you saw, I messed with s8 weights again yesterday, there's been overload issues there the past two days [16:05:47] cdanis: it is because I depooled db1114 by mistake when depooling db1104 [16:05:51] I pooled it back today [16:05:54] ok that is what we thought but we weren't sure D: [16:05:56] :D [16:06:02] I have been running 100 mph lately, I need to slow down [16:06:52] Can't wait to have the new hosts ready to be honest [16:08:46] heh https://jira.mariadb.org/browse/MDEV-21794 has been raised to blocker, but as far as I know 10.4.13 is "closed", we'll see what happens [18:23:31] for a new Buster build should I use wmf-mariadb104 or wmf-mariadb103? (Or should I skip the wmf builds entirely and use some debian upstream pacakge?) [18:23:45] This is a pretty trivial use-case, it's just the local zone cache for powerdns [19:04:21] andrewbogott: we are using 104 in production (or starting to) [19:04:23] so I would suggest 104 [19:05:37] Ok, will try that. Thanks! [19:54:03] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Bstorm) Ok, in that case, I'll go ahead and run with the firmware upgrade because I think that's a good idea here no matter what. I will try to squeeze that in today (the morning was swa... [22:20:25] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Bstorm) `lang=shell-session root@labsdb1011:~# hpssacli controller all show config detail | grep Firm Firmware Version: 7.00 ` It's up-to-date now. All yours!