[04:15:28] 10DBA, 10Operations: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Marostegui) Thank you, I can see it now! ` => controller all show status Smart Array P840 in Slot 1 Controller Status: OK Cache Status: Not Configured Battery/Capacitor Status: OK => ` I have start... [04:19:24] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [04:30:25] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [04:56:40] marostegui: good morning i hate you [04:56:55] morning! isn't it nice to have the whole morning available? [04:57:34] marostegui: this isn't even a real time :( [04:57:50] I know, it is late :( [04:57:58] ✊ [05:23:49] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [05:25:35] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [05:26:34] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [05:30:10] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [05:36:34] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [06:16:19] 10DBA: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) I have written some documentation about failing over es hosts, as it is slightly different from the normal sX failovers: https://wikitech.wikimedia.org/wiki/MariaDB#External_store_section_failover_checklist [06:16:29] ^ feel free to add/delete/modify thins [06:16:30] things [06:19:29] TIL of pt-config-diff [06:32:46] 10DBA: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1023.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202007070632_marostegui_1791.l... [06:41:00] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10jcrespo) >>! In T254795#6283878, @dpifke wrote: > Given the current total size of ~18 GB and write rate of ~1 GB/mo, I'm thinking this shouldn't add a huge cost. I agree, which is why I asked- it takes little... [06:46:14] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10jcrespo) > Working on it. Backup of the new db was setup, please allow 1 week to confirm they are running normally before closing the ticket. [06:53:57] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10QChris) >>! In T255715#6275884, @QChris wrote: > [...] then I think we won't need the actual databa... [06:57:16] ^if manuel or stephen can take care of that ticket's main db stuff, I can take care of removing backup specific things [06:58:04] jynus: yep, I can do that [06:58:25] kormat: I am reimaging es1023 and I am prompted with the manual disk partitioning, I wasn't expecting that [06:58:26] i can confirm that marostegui can do that [06:58:34] kormat: huuh [06:58:46] er [06:58:51] marostegui: can i take a look? [06:58:55] definitely [06:59:02] you are most kind [07:00:27] the gerrit BTW wasn't a ping for urgency [07:00:45] just for coordination, it is not a priority to remove unused stuff [07:02:08] although let's take a last backup when you are about to actually delete stuff [07:02:24] marostegui: ohh. that host is not matched by the line in netboot.cfg [07:02:48] it has `es101[1-9]` [07:03:04] it was non-destructive anyway, right? [07:03:06] marostegui: if it's not urgent, let me audit that line to see if anything else is missing [07:03:35] fails back to just being "stuck"? [07:03:42] jynus: yep, waiting for human input [07:03:45] cool [07:03:52] then things work as intended :-D [07:08:21] ah right! yeah, those are the new ones [07:08:27] and hence not in that line indeed [07:09:48] marostegui: the complexity/specificity of this line still concerns me a bit. [07:10:22] yeah, we could make it as generic as db hosts [07:10:44] is there something we predict that could break because of it? [07:10:54] `db[12][0-9][0-9][0-9]` etc [07:11:18] I see having sooner special db* hosts rather than es/pc ones [07:11:19] or even `db[12]*` [07:11:29] kormat: careful [07:11:41] those are not regular expressions even if they look like [07:11:45] they are bash selectors [07:11:52] they are not pcre [07:11:58] don't worry, i know [07:12:01] ok [07:12:23] Can I get a check on https://gerrit.wikimedia.org/r/c/operations/puppet/+/609914/1/modules/install_server/files/autoinstall/netboot.cfg ? [07:13:16] we don't have any es100X host anymore? [07:13:34] jynus: apparently not [07:13:38] (at least according to cumin) [07:13:45] so that should fix it [07:14:22] ...until we buy db103X hosts [07:16:13] no, we don't have any es100X anymore [07:16:31] bacula queue seems a bit slow, backups are taking quite some time today [07:16:52] we may get some stale backup alerts today [07:20:44] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1023.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20200707... [07:28:16] 10DBA, 10Epic, 10User-Kormat: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 (10Kormat) [07:28:25] 10DBA, 10Epic, 10User-Kormat: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 (10Kormat) p:05Triage→03Medium [07:40:15] 10DBA, 10Operations: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Marostegui) 05Open→03Resolved db1079 fully repooled, db1136 also got its original weight restored. All done! Thanks you John for replacing the BBU so fast! [07:46:31] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1023.eqiad.wmnet'] ` and were **ALL** successful. [07:48:50] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [08:12:28] 10DBA, 10Epic, 10User-Kormat: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin2001.codfw.wmnet for hosts: ` ['es2021.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2020070... [08:21:09] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10Marostegui) I would suggest we start by renaming the tables first, to make sure nothing really break... [08:40:53] 10DBA, 10Epic, 10User-Kormat: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2021.codfw.wmnet'] ` and were **ALL** successful. [08:44:58] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Marostegui) Current sizes as of 7th July: wikidata ` -rw-r--r-- 1 dump dump 644M Jul 7 06:21 dump.s8.2020-07-07--05-55-2... [09:14:02] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Peachey88) [09:14:24] woot? [09:14:34] Why didn't my herald rule work [09:15:11] From what I can see it also didn't add andre [09:15:47] that's the s6 master? [09:15:53] correct [09:16:22] marostegui: https://phabricator.wikimedia.org/H9 specifically excludes tasks in ops-eqiad so that's why Andre wasnt added [09:17:06] Ah I see [09:17:12] I need to check why I didn't get add [09:17:50] what rule should have done that? [09:18:29] https://phabricator.wikimedia.org/H281 [09:19:09] mmm, I am not sure that ticket is accurated [09:19:14] root@db1131:~# megacli -pdlist -a0 | grep -i online [09:19:14] Firmware state: Online, Spun Up [09:19:14] Firmware state: Online, Spun Up [09:19:14] Firmware state: Online, Spun Up [09:19:14] Firmware state: Online, Spun Up [09:19:15] Firmware state: Online, Spun Up [09:19:25] https://phabricator.wikimedia.org/herald/transcript/3773641/ [09:19:32] "When all of these conditions are met:" [09:19:33] oh, the whole disk is gone [09:19:34] interesting [09:19:34] that's your problem [09:19:43] that rule needs that all 3 regexes match [09:20:05] it has been working fine for years, but I edited it a few weeks ago, so maybe it broke it [09:20:07] I will check later [09:20:46] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) @wiki_willy this host is under warranty, can we get a new disk for it? ` [35898752.940170] megaraid_sas 0000:18:00.0: 726 (647382021s/0x0001/CRIT) - VD 00/0 is now DEGRADED [35898999.592143] m... [09:22:07] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) p:05Triage→03High This is s6 primary database master [09:23:34] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) Controller's log in case it is needed to get the RMA: ` seqNum: 0x000002d1 Time: Mon Jul 6 20:20:21 2020 Code: 0x0000010c Class: 1 Locale: 0x02 Event Description: PD 00(e0x20/s0) Path 500056... [09:27:02] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [09:27:14] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) [09:27:16] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) 05Open→03Resolved Everything was done. Thanks everyone for helping out! [10:26:30] 10DBA, 10Operations, 10Patch-For-Review: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) Failover procedure: OLD MASTER: db1097 NEW MASTER: db1080 [x] Check configuration differences between new and old master `$ pt-config-diff h=db1097.eqiad.wmn... [11:59:48] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [12:08:40] jynus: https://phabricator.wikimedia.org/T253217 when do you plan to use db1084? Can I borrow it to be able to upgrade misc clusters? The result will be the same, you'll still have one host for backup test-s1, but having db1097 broken has killed my plans to be able to keep upgrading misc hosts, so I would need to borrow this host [12:09:14] It is just required for the upgrade itself, but I will be using the old misc master to upgrade other clusters, and as soon as m5 is finished, the resulting host will be yours :) [12:09:36] reminds me of how google got started :) [12:10:05] with a broken host? :) [12:10:25] (sergey and larry needed machines to run their search engine project on, so they arranged that new machines for the dept would get delivered to them first, they'd then use them for a month, and deliver them fully configured to the department) [12:10:43] haha [12:21:25] Larry Page has joined the WMF? [12:21:57] ohsorry I have misread [12:29:57] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [12:36:40] ok, you can take it [12:36:47] \o/ [12:38:15] hashar: :) [12:38:56] if only i knew what a company was back in the time, I would definitely have joined I guess ;) [12:39:18] as chief coffee officer probably or some janitor position (cause a clean office IS important) [13:48:06] 10DBA, 10Epic, 10Patch-For-Review, 10User-Kormat: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 (10Kormat) Current status: - db2021 has been reimaged - db2022 had already been reimaged - db2020 has been replaced by db2021 as codfw master for es4 - db2020 has replic... [15:04:12] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10wiki_willy) a:03Jclark-ctr @Jclark-ctr - can you send in the RMA for this one, when you get in later today? Thanks, Willy [18:29:21] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10dpifke) [19:41:20] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Jclark-ctr) Confirm Confirmed: Service Request 1029100504 was successfully submitted. [20:26:21] 10DBA, 10Schema-change, 10User-DannyS712: iwlinks indexes should be UNIQUE INDEXes - https://phabricator.wikimedia.org/T256842 (10eprodromou) [20:27:06] 10DBA, 10Schema-change, 10User-DannyS712: slot_revision_origin_role should be a UNIQUE INDEX - https://phabricator.wikimedia.org/T256841 (10eprodromou)