[07:03:41] <wikibugs>	 10DBA, 10Patch-For-Review, 10User-notice: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 (10Marostegui) Adding the #user-notice tag as Etherpad will be on read-only for a few seconds Thursday 25th at 08:00 AM UTC Will email wikitech-l shortly.
[07:32:07] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db2133.codfw.wmnet'] ` The log can be found in `/var/log/wmf...
[07:49:22] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Data consistency check passed.
[07:50:10] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) @Kormat I am going to apply the MCR schema change and once I am done, maybe we can reimage this to Buster and 10.4 while DCOPs look for a BBU?
[07:58:05] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2133.codfw.wmnet'] `  and were **ALL** successful.
[08:06:10] <kormat>	 marostegui: re: disabling notifications while reimaging, i think that's less important than previously
[08:06:29] <kormat>	 with the old system it required 2 stages of manual intervention to get the machine back running
[08:06:33] <kormat>	 it's now down to 1
[08:06:51] <marostegui>	 But if you aren't fast enough in bringing mysql back up it might alert, no?
[08:06:54] <kormat>	 so if you're confident that you'll be able to do the post-reimaging steps (starting mariadb, mysql_upgrade, etc etc) within the 2h
[08:07:12] <kormat>	 that's true, indeed.
[08:07:36] <marostegui>	 as it is a codfw host, it shouldn't page really, but....
[08:11:16] <kormat>	 one thing i look forward to with prom AM: you can put in silences for hosts that don't "exist" yet.
[08:11:42] <jynus>	 I am not sure we will be able to do fully autonomous things
[08:11:51] <jynus>	 normally when you reimage there is a reason, like an upgrade
[08:12:06] <jynus>	 so you still need to mark it as such
[08:12:15] <jynus>	 the difference is that we could do it in bulk
[08:27:56] <wikibugs>	 10DBA, 10Data-Services, 10User-Ladsgroup, 10cloud-services-team (Kanban): Prepare and check storage layer for shnwiktionary - https://phabricator.wikimedia.org/T256010 (10Kormat) Data check was clean, and the view database was created. It's now ready for #cloud-services-team.
[08:28:25] <wikibugs>	 10DBA, 10Data-Services, 10User-Ladsgroup, 10cloud-services-team (Kanban): Prepare and check storage layer for shnwiktionary - https://phabricator.wikimedia.org/T256010 (10Kormat) p:05Triage→03Medium
[08:40:10] <wikibugs>	 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) @jcrespo let's do this Tuesday 30th at 05:00 AM UTC?
[08:41:39] <wikibugs>	 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10jcrespo) ok.
[08:51:18] <wikibugs>	 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui)
[08:51:47] <wikibugs>	 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui)
[09:17:29] <godog>	 jynus: FYI the cron spam from puppet-agent-stats on reimage should be fixed now
[09:18:45] <jynus>	 what was changed?
[09:19:04] <godog>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/606977
[09:19:38] <jynus>	 I see, so it was only a race condition with dependencies?
[09:20:11] <jynus>	 I wouldn't have caught it, nice
[09:20:41] <godog>	 that'd be my guess yes, can't think of anything else that would cause only a couple of invocations to fail
[09:21:05] <jynus>	 if at a later time that gets changed into a systemd timer that would have helped debug too
[09:21:34] <jynus>	 as it would be synced compared to puppet execution
[09:21:57] <jynus>	 thanks for the time, godog
[09:23:27] <godog>	 np! let me know how the next reimage goes
[09:23:45] <jynus>	 I am actually about to do one
[09:23:47] <jynus>	 will report
[09:32:47] <jynus>	 https://doc.wikimedia.org/#infrastructure -> search starting by t
[09:33:29] <marostegui>	 oh nice!
[09:48:14] <wikibugs>	 10DBA: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['db2101.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006230948_jynus_15257.log`.
[10:07:12] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) @Kormat MCR schema change applied, you can proceed with the reimage anytime  Thank you!
[10:07:46] <wikibugs>	 10DBA: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2101.codfw.wmnet'] `  and were **ALL** successful.
[10:23:26] <kormat>	 marostegui: i'm in the habit of manually stopping mariadb and unmounting /srv before reimaging, just for peace of mind
[10:23:59] <marostegui>	 kormat: <3
[10:24:07] <jynus>	 we found that it is quite safe
[10:24:20] <jynus>	 you cannot know if someone else is logged in on that server
[10:24:27] <jynus>	 or you forgot to stop another mysql instance
[10:24:33] <jynus>	 so the umount will fail
[10:24:51] <kormat>	 marostegui: i've previously ran into issues with systemd and a heavily loaded prometheus server where systemd would time out after 5 mins and kill -9 the prom daemon which was busy trying to flush everything to disk, leading to corruption everytime we rebooted
[10:25:22] <jynus>	 indeed
[10:25:23] <kormat>	 jynus: yeah that's useful
[10:25:34] <jynus>	 our package has a timeout of infinity
[10:25:40] <marostegui>	 kormat: yeah, that used to be a problem in mysql (especially with init.d and big buffer pools) where stopping the daemon would even said it is all fine but the buffer pool would keep flushing
[10:25:52] <jynus>	 but independently of that, systemd will shutdown despite of that
[10:26:11] <marostegui>	 kormat: so even with a failed BBU, umounting /srv would give us even more peace of mind
[10:26:24] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1088.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-aut...
[10:28:00] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Reimaging in progress.
[10:31:35] <jynus>	 one trick I do
[10:31:55] <jynus>	 after mysql_upgrade, but before the restart on a major version upgrade
[10:32:18] <jynus>	 set global innodb_buffer_pool_dump_at_shutdown=0;
[10:32:39] <jynus>	 that way it loads the buffer quickly- otherwise it gets lost on the restart
[10:38:56] <wikibugs>	 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10ayounsi)
[10:48:07] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1088.eqiad.wmnet'] `  and were **ALL** successful.
[10:56:21] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10jcrespo)
[11:01:14] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Reimage done, and host has caught up with replication.
[11:19:12] <wikibugs>	 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10jcrespo) a:05Marostegui→03jcrespo About to move db1102 sections to db1145.
[11:36:12] <wikibugs>	 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database or idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond)
[11:45:32] <wikibugs>	 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database or idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Thanks for the detailed ticket. A few comments.  1) Let's not use `-`, if you really want that, we can go for `can_test`. 2) Can probably place this into m...
[11:47:33] <wikibugs>	 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui)
[11:48:00] <wikibugs>	 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui)
[11:48:15] <wikibugs>	 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui) 05Open→03Resolved All done
[11:48:19] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui)
[11:48:43] <wikibugs>	 10DBA: inverse_timestamp column exists in text table, it shouldn't - https://phabricator.wikimedia.org/T250063 (10Marostegui) 05Open→03Resolved This is done as it was included at T250066
[11:48:48] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui)
[11:50:52] <marostegui>	 Amir1: can you run your schema differences script?
[11:51:02] <marostegui>	 I have finished the first batch of tasks we created a few months ago
[11:51:18] <Amir1>	 Sure
[11:51:45] <Amir1>	 One thing is that with the MCR changes being merged but not deployed the list of drifts is going to the roof
[11:51:58] <marostegui>	 ah true :(
[11:52:11] <marostegui>	 can you exclude tables? like revision and archive?
[11:52:27] <marostegui>	 MCR is being deployed in s6 as we speak
[11:54:15] <Amir1>	 yeah sure
[12:06:31] <marostegui>	 kormat: for db1088 let's wait for DCOPs to come back with whether they have a BBU or not, if they don't and we have to order, let's pool it with some weight, if they do, you'd need to coordinate with them for the replacement
[12:06:47] <marostegui>	 typically just stop mysql and leave the host off the day you guys agree on
[12:07:11] <marostegui>	 I normally do that a few hours before the agreed hour and comment on the task saying that the host is ready for them to act on
[12:07:11] <kormat>	 gotcha
[12:07:25] <marostegui>	 they normally leave it back ON but with mysql stopped of course
[12:14:13] <wikibugs>	 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui)
[12:14:30] <wikibugs>	 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui)
[12:41:00] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1145.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006231240_jynus_20243.log`.
[12:52:55] <wikibugs>	 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) MariaDB 10.1 doesn't allow the increase of varchar/varbinary with INPLACE alter tables, thi...
[13:03:40] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1145.eqiad.wmnet'] `  and were **ALL** successful.
[13:10:22] <wikibugs>	 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database or idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) >>! In T256120#6248144, @Marostegui wrote: > Thanks for the detailed ticket. > A few comments. >  > 1) Let's not use `-`, if you really want that, we can go for...
[13:13:03] <wikibugs>	 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Kormat)
[13:21:11] <wikibugs>	 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui)
[13:36:01] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Jclark-ctr) @marostegui we have spare bbu. i happen to be on site today can.  Are you available to assist?
[13:39:19] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) @Jclark-ctr - i'm available. I can power the host off now for you to do the replacement. Just let me know when it's back. Cheers.
[13:40:12] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Host is now off.
[13:46:05] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Jclark-ctr) @Kormat   bbu replaced powering up right now
[13:46:35] <kormat>	 damn that was fast :)
[13:55:27] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) @Jclark-ctr : great, thanks :)  The diagnostics are happy with the new bbu: ` root@db1088:~# hpssacli ctrl all show status  Smart Array P840 in Slot 1    Controller Status: OK    Cache Status: Not Configur...
[13:56:05] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) a:05Jclark-ctr→03Kormat
[14:00:19] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Jclark-ctr) 05Open→03Resolved
[14:01:34] <wikibugs>	 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat)
[14:02:05] <wikibugs>	 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) 05Resolved→03Open Re-opening for us to keep track of re-adding the host back into service.
[14:21:06] <wikibugs>	 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) s6 eqiad progress  [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1005...
[15:19:31] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Papaul) `  Create Dispatch: Success You have successfully submitted request SR1028000240.
[17:29:00] <wikibugs>	 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10wkandek) Should a BBU failure cause a reboot?
[22:09:43] <wikibugs>	 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10wkandek) HP says, the server should not reboot due to battery failure: https://support.hpe.com/hpesc/public/docDisplay?docId=mmr_kc-0126260#:~:text=POST%20Error%3A%20313%20%2D%20HPE%20Smart,other%20reasons%20for%20a%20reboot.  "...