[10:37:39] <Emperor>	 what a lot of image tickets to come back to :-/
[10:44:19] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:59:42] <marostegui>	 Amir1: I am going to do https://phabricator.wikimedia.org/T381993
[10:59:58] <Amir1>	 oh nice
[11:00:07] <Amir1>	 I have so many schema changes on the old master :P
[11:03:40] <marostegui>	 Amir1: You'll have to wait probably till monday, as I will deply the revision table schema change, which takes around 48h
[11:03:50] <marostegui>	 I will let you know once it is free anyway
[11:03:54] <marostegui>	 Going to start the switchover
[11:03:54] <Amir1>	 💔
[11:20:37] <marostegui>	 Amir1: DOne, I will start my schema change and let you know when ready
[11:20:43] <marostegui>	 I will reboot the host first for the kernel upgrade
[11:20:47] <Amir1>	 Thanks!
[11:21:45] <kwakuofori>	 happy new year, guys 
[11:22:00] <marostegui>	 kwakuofori: likewise
[11:23:36] <Amir1>	 Wishing all the kernel panics and RAID degradations for this year
[11:24:27] <marostegui>	 XD
[11:25:59] <marostegui>	 reboot
[11:26:02] <marostegui>	 gah
[11:49:41] <Emperor>	 Amir1: if you want to restart thumb-deletions, feel free
[11:49:49] <Emperor>	 (I've been vacuuming container DBs again today)
[11:50:22] <Amir1>	 sure!
[11:50:43] <Amir1>	 the containers for 00-0f now should be quite small now
[11:52:57] <Amir1>	 0f is now only 1.3M objects (131GB) compared to 1f (6.9M objects, 941GB)
[12:26:52] <Amir1>	 Emperor: Restarted
[13:06:57] <Emperor>	 thanks
[13:35:09] <Amir1>	 oh since we did the emergency switchover of s5, the old master is ready to take schema changes
[13:35:24] <Amir1>	 but first, lunch
[13:48:03] <marostegui>	 Amir1: Probably not needed, as I recloned it from another host (which I guess it had all of them already)?
[13:48:14] <marostegui>	 So double check in case they are already applied
[13:51:19] <Amir1>	 ah, cool
[15:07:27] <marostegui>	 sobanski: The hosts we discussed you'd need were in codfw or eqiad?
[15:08:06] <sobanski>	 Doesn't really matter TBH
[15:08:32] <sobanski>	 If we're talking about the one for Phab
[15:08:42] <marostegui>	 Yeah, for the upgrade
[15:08:48] <marostegui>	 Was it an upgrade right?
[15:08:53] <sobanski>	 Yes
[15:09:19] <marostegui>	 I think I will do eqiad then, just in case we need to rollback
[15:09:23] <marostegui>	 Better to have the DB in eqiad
[15:10:00] <sobanski>	 If you haven't started yet then let me confirm with Andre and Brennen that they'll be available to run the test soon
[15:10:08] <marostegui>	 Yeah, I haven't started anything
[15:10:15] <sobanski>	 So that we don't keep a host hostage for longer than necessary
[15:10:16] <marostegui>	 But I will also need a few days
[15:11:32] <marostegui>	 Good, I still need to find and allocate a host
[15:11:43] <marostegui>	 But it should be fine
[15:11:56] <sobanski>	 Thanks
[15:12:20] <marostegui>	 sobanski: is there a task for that somewhere?
[15:12:25] <marostegui>	 so I can create one subtask for us
[15:12:31] <sobanski>	 Yes, there is, let me find it
[15:12:46] <sobanski>	 https://phabricator.wikimedia.org/T370266
[15:12:52] <marostegui>	 Thank you :)
[15:13:23] <sobanski>	 You may have thoughts on what needs to be run
[15:14:42] <marostegui>	 sobanski: Yeah, I just commented
[15:15:39] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db2130 is CRITICAL: 12.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2130&var-port=9104
[15:15:58] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db2145 is CRITICAL: 16.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2145&var-port=9104
[15:15:59] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db2216 is CRITICAL: 13.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2216&var-port=9104
[15:16:01] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db2153 is CRITICAL: 16.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2153&var-port=9104
[15:16:05] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1219 is CRITICAL: 62.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1219&var-port=9104
[15:16:09] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 27.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[15:16:27] <marostegui>	 woot
[15:16:31] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1184 is CRITICAL: 36 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1184&var-port=9104
[15:16:31] <sobanski>	 Are dumps back?
[15:16:33] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1207 is CRITICAL: 69 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104
[15:16:33] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1196 is CRITICAL: 23 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1196&var-port=9104
[15:16:35] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1169 is CRITICAL: 19.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1169&var-port=9104
[15:16:35] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1186 is CRITICAL: 12.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104
[15:16:41] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db2130 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2130&var-port=9104
[15:16:43] <marostegui>	 That shouldn't be as it is affecting all of them
[15:16:55] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1218 is CRITICAL: 30.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1218&var-port=9104
[15:17:00] <marostegui>	 But it seems like a small spike as orchestrator looks good
[15:17:01] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db2153 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2153&var-port=9104
[15:17:01] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db2216 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2216&var-port=9104
[15:17:05] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1235 is CRITICAL: 42 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1235&var-port=9104
[15:17:09] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1195 is CRITICAL: 16.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1195&var-port=9104
[15:17:11] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1234 is CRITICAL: 82.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1234&var-port=9104
[15:17:13] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1232 is CRITICAL: 36.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1232&var-port=9104
[15:17:35] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1186 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104
[15:17:59] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db2145 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2145&var-port=9104
[15:18:09] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1195 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1195&var-port=9104
[15:18:13] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1232 is OK: (C)10 ge (W)5 ge 4.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1232&var-port=9104
[15:18:31] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1184 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1184&var-port=9104
[15:18:33] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1196 is OK: (C)10 ge (W)5 ge 3.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1196&var-port=9104
[15:18:35] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1169 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1169&var-port=9104
[15:18:46] <marostegui>	 Looks like the master had a spike on writes
[15:18:55] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1218 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1218&var-port=9104
[15:19:05] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1219 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1219&var-port=9104
[15:19:05] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1235 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1235&var-port=9104
[15:19:09] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[15:19:27] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db2203&var-port=9104&from=1735829241209&to=1735829767100&viewPanel=12
[15:19:33] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1207 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104
[15:20:11] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1234 is OK: (C)10 ge (W)5 ge 1.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1234&var-port=9104
[15:28:30] <marostegui>	 Amir1: let me know what you think of https://gitlab.wikimedia.org/toolforge-repos/switchmaster/-/merge_requests/7
[15:32:22] <Amir1>	 marostegui: I'm happy with the change!
[15:32:36] <Amir1>	 Actually the write spike was me. There wasn't any way around it. Sorry
[15:32:44] <Amir1>	 but it was a one-off
[15:33:02] * marostegui stares at Amir1 
[15:33:10] <marostegui>	 Amir1: I will merge then!
[15:33:36] <Amir1>	 Let me know if you don't have merge rights
[15:33:50] <Amir1>	 sorry about the writes 
[15:34:02] <marostegui>	 Amir1: I do, can you re-run the switchmaster service so it picks up the new template? it doesn't have to be now anyway
[15:34:26] <Amir1>	 sure
[15:35:11] <marostegui>	 Amir1: Better to know it was you!
[15:35:18] <Amir1>	 https://www.irccloud.com/pastebin/yuXvHJRs/
[15:35:25] <Amir1>	 done ^
[15:35:31] <marostegui>	 Thanks!
[15:38:05] <marostegui>	 Amir1: it looks good https://phabricator.wikimedia.org/T382900
[15:40:13] <Amir1>	 The mess was because of deleting only 3k rows: https://phabricator.wikimedia.org/T54778#10426991 the problem was those rows were hidden among 3M rows and there wasn't any index on them.