[07:03:41] 10DBA, 10Patch-For-Review, 10User-notice: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 (10Marostegui) Adding the #user-notice tag as Etherpad will be on read-only for a few seconds Thursday 25th at 08:00 AM UTC Will email wikitech-l shortly. [07:32:07] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db2133.codfw.wmnet'] ` The log can be found in `/var/log/wmf... [07:49:22] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Data consistency check passed. [07:50:10] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) @Kormat I am going to apply the MCR schema change and once I am done, maybe we can reimage this to Buster and 10.4 while DCOPs look for a BBU? [07:58:05] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2133.codfw.wmnet'] ` and were **ALL** successful. [08:06:10] marostegui: re: disabling notifications while reimaging, i think that's less important than previously [08:06:29] with the old system it required 2 stages of manual intervention to get the machine back running [08:06:33] it's now down to 1 [08:06:51] But if you aren't fast enough in bringing mysql back up it might alert, no? [08:06:54] so if you're confident that you'll be able to do the post-reimaging steps (starting mariadb, mysql_upgrade, etc etc) within the 2h [08:07:12] that's true, indeed. [08:07:36] as it is a codfw host, it shouldn't page really, but.... [08:11:16] one thing i look forward to with prom AM: you can put in silences for hosts that don't "exist" yet. [08:11:42] I am not sure we will be able to do fully autonomous things [08:11:51] normally when you reimage there is a reason, like an upgrade [08:12:06] so you still need to mark it as such [08:12:15] the difference is that we could do it in bulk [08:27:56] 10DBA, 10Data-Services, 10User-Ladsgroup, 10cloud-services-team (Kanban): Prepare and check storage layer for shnwiktionary - https://phabricator.wikimedia.org/T256010 (10Kormat) Data check was clean, and the view database was created. It's now ready for #cloud-services-team. [08:28:25] 10DBA, 10Data-Services, 10User-Ladsgroup, 10cloud-services-team (Kanban): Prepare and check storage layer for shnwiktionary - https://phabricator.wikimedia.org/T256010 (10Kormat) p:05Triage→03Medium [08:40:10] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) @jcrespo let's do this Tuesday 30th at 05:00 AM UTC? [08:41:39] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10jcrespo) ok. [08:51:18] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [08:51:47] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [09:17:29] jynus: FYI the cron spam from puppet-agent-stats on reimage should be fixed now [09:18:45] what was changed? [09:19:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/606977 [09:19:38] I see, so it was only a race condition with dependencies? [09:20:11] I wouldn't have caught it, nice [09:20:41] that'd be my guess yes, can't think of anything else that would cause only a couple of invocations to fail [09:21:05] if at a later time that gets changed into a systemd timer that would have helped debug too [09:21:34] as it would be synced compared to puppet execution [09:21:57] thanks for the time, godog [09:23:27] np! let me know how the next reimage goes [09:23:45] I am actually about to do one [09:23:47] will report [09:32:47] https://doc.wikimedia.org/#infrastructure -> search starting by t [09:33:29] oh nice! [09:48:14] 10DBA: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['db2101.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006230948_jynus_15257.log`. [10:07:12] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) @Kormat MCR schema change applied, you can proceed with the reimage anytime Thank you! [10:07:46] 10DBA: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2101.codfw.wmnet'] ` and were **ALL** successful. [10:23:26] marostegui: i'm in the habit of manually stopping mariadb and unmounting /srv before reimaging, just for peace of mind [10:23:59] kormat: <3 [10:24:07] we found that it is quite safe [10:24:20] you cannot know if someone else is logged in on that server [10:24:27] or you forgot to stop another mysql instance [10:24:33] so the umount will fail [10:24:51] marostegui: i've previously ran into issues with systemd and a heavily loaded prometheus server where systemd would time out after 5 mins and kill -9 the prom daemon which was busy trying to flush everything to disk, leading to corruption everytime we rebooted [10:25:22] indeed [10:25:23] jynus: yeah that's useful [10:25:34] our package has a timeout of infinity [10:25:40] kormat: yeah, that used to be a problem in mysql (especially with init.d and big buffer pools) where stopping the daemon would even said it is all fine but the buffer pool would keep flushing [10:25:52] but independently of that, systemd will shutdown despite of that [10:26:11] kormat: so even with a failed BBU, umounting /srv would give us even more peace of mind [10:26:24] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1088.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-aut... [10:28:00] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Reimaging in progress. [10:31:35] one trick I do [10:31:55] after mysql_upgrade, but before the restart on a major version upgrade [10:32:18] set global innodb_buffer_pool_dump_at_shutdown=0; [10:32:39] that way it loads the buffer quickly- otherwise it gets lost on the restart [10:38:56] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10ayounsi) [10:48:07] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1088.eqiad.wmnet'] ` and were **ALL** successful. [10:56:21] 10DBA, 10Patch-For-Review: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10jcrespo) [11:01:14] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Reimage done, and host has caught up with replication. [11:19:12] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10jcrespo) a:05Marostegui→03jcrespo About to move db1102 sections to db1145. [11:36:12] 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database or idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) [11:45:32] 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database or idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Thanks for the detailed ticket. A few comments. 1) Let's not use `-`, if you really want that, we can go for `can_test`. 2) Can probably place this into m... [11:47:33] 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui) [11:48:00] 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui) [11:48:15] 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui) 05Open→03Resolved All done [11:48:19] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [11:48:43] 10DBA: inverse_timestamp column exists in text table, it shouldn't - https://phabricator.wikimedia.org/T250063 (10Marostegui) 05Open→03Resolved This is done as it was included at T250066 [11:48:48] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [11:50:52] Amir1: can you run your schema differences script? [11:51:02] I have finished the first batch of tasks we created a few months ago [11:51:18] Sure [11:51:45] One thing is that with the MCR changes being merged but not deployed the list of drifts is going to the roof [11:51:58] ah true :( [11:52:11] can you exclude tables? like revision and archive? [11:52:27] MCR is being deployed in s6 as we speak [11:54:15] yeah sure [12:06:31] kormat: for db1088 let's wait for DCOPs to come back with whether they have a BBU or not, if they don't and we have to order, let's pool it with some weight, if they do, you'd need to coordinate with them for the replacement [12:06:47] typically just stop mysql and leave the host off the day you guys agree on [12:07:11] I normally do that a few hours before the agreed hour and comment on the task saying that the host is ready for them to act on [12:07:11] gotcha [12:07:25] they normally leave it back ON but with mysql stopped of course [12:14:13] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [12:14:30] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [12:41:00] 10DBA, 10Patch-For-Review: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1145.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006231240_jynus_20243.log`. [12:52:55] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) MariaDB 10.1 doesn't allow the increase of varchar/varbinary with INPLACE alter tables, thi... [13:03:40] 10DBA, 10Patch-For-Review: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1145.eqiad.wmnet'] ` and were **ALL** successful. [13:10:22] 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database or idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) >>! In T256120#6248144, @Marostegui wrote: > Thanks for the detailed ticket. > A few comments. > > 1) Let's not use `-`, if you really want that, we can go for... [13:13:03] 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Kormat) [13:21:11] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [13:36:01] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Jclark-ctr) @marostegui we have spare bbu. i happen to be on site today can. Are you available to assist? [13:39:19] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) @Jclark-ctr - i'm available. I can power the host off now for you to do the replacement. Just let me know when it's back. Cheers. [13:40:12] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Host is now off. [13:46:05] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Jclark-ctr) @Kormat bbu replaced powering up right now [13:46:35] damn that was fast :) [13:55:27] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) @Jclark-ctr : great, thanks :) The diagnostics are happy with the new bbu: ` root@db1088:~# hpssacli ctrl all show status Smart Array P840 in Slot 1 Controller Status: OK Cache Status: Not Configur... [13:56:05] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) a:05Jclark-ctr→03Kormat [14:00:19] 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Jclark-ctr) 05Open→03Resolved [14:01:34] 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) [14:02:05] 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) 05Resolved→03Open Re-opening for us to keep track of re-adding the host back into service. [14:21:06] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) s6 eqiad progress [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1005... [15:19:31] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Papaul) ` Create Dispatch: Success You have successfully submitted request SR1028000240. [17:29:00] 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10wkandek) Should a BBU failure cause a reboot? [22:09:43] 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10wkandek) HP says, the server should not reboot due to battery failure: https://support.hpe.com/hpesc/public/docDisplay?docId=mmr_kc-0126260#:~:text=POST%20Error%3A%20313%20%2D%20HPE%20Smart,other%20reasons%20for%20a%20reboot. "...