[10:37:39] what a lot of image tickets to come back to :-/ [10:44:19] FIRING: PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:59:42] Amir1: I am going to do https://phabricator.wikimedia.org/T381993 [10:59:58] oh nice [11:00:07] I have so many schema changes on the old master :P [11:03:40] Amir1: You'll have to wait probably till monday, as I will deply the revision table schema change, which takes around 48h [11:03:50] I will let you know once it is free anyway [11:03:54] Going to start the switchover [11:03:54] 💔 [11:20:37] Amir1: DOne, I will start my schema change and let you know when ready [11:20:43] I will reboot the host first for the kernel upgrade [11:20:47] Thanks! [11:21:45] happy new year, guys [11:22:00] kwakuofori: likewise [11:23:36] Wishing all the kernel panics and RAID degradations for this year [11:24:27] XD [11:25:59] reboot [11:26:02] gah [11:49:41] Amir1: if you want to restart thumb-deletions, feel free [11:49:49] (I've been vacuuming container DBs again today) [11:50:22] sure! [11:50:43] the containers for 00-0f now should be quite small now [11:52:57] 0f is now only 1.3M objects (131GB) compared to 1f (6.9M objects, 941GB) [12:26:52] Emperor: Restarted [13:06:57] thanks [13:35:09] oh since we did the emergency switchover of s5, the old master is ready to take schema changes [13:35:24] but first, lunch [13:48:03] Amir1: Probably not needed, as I recloned it from another host (which I guess it had all of them already)? [13:48:14] So double check in case they are already applied [13:51:19] ah, cool [15:07:27] sobanski: The hosts we discussed you'd need were in codfw or eqiad? [15:08:06] Doesn't really matter TBH [15:08:32] If we're talking about the one for Phab [15:08:42] Yeah, for the upgrade [15:08:48] Was it an upgrade right? [15:08:53] Yes [15:09:19] I think I will do eqiad then, just in case we need to rollback [15:09:23] Better to have the DB in eqiad [15:10:00] If you haven't started yet then let me confirm with Andre and Brennen that they'll be available to run the test soon [15:10:08] Yeah, I haven't started anything [15:10:15] So that we don't keep a host hostage for longer than necessary [15:10:16] But I will also need a few days [15:11:32] Good, I still need to find and allocate a host [15:11:43] But it should be fine [15:11:56] Thanks [15:12:20] sobanski: is there a task for that somewhere? [15:12:25] so I can create one subtask for us [15:12:31] Yes, there is, let me find it [15:12:46] https://phabricator.wikimedia.org/T370266 [15:12:52] Thank you :) [15:13:23] You may have thoughts on what needs to be run [15:14:42] sobanski: Yeah, I just commented [15:15:39] PROBLEM - MariaDB sustained replica lag on s1 on db2130 is CRITICAL: 12.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2130&var-port=9104 [15:15:58] PROBLEM - MariaDB sustained replica lag on s1 on db2145 is CRITICAL: 16.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2145&var-port=9104 [15:15:59] PROBLEM - MariaDB sustained replica lag on s1 on db2216 is CRITICAL: 13.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2216&var-port=9104 [15:16:01] PROBLEM - MariaDB sustained replica lag on s1 on db2153 is CRITICAL: 16.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2153&var-port=9104 [15:16:05] PROBLEM - MariaDB sustained replica lag on s1 on db1219 is CRITICAL: 62.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1219&var-port=9104 [15:16:09] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 27.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [15:16:27] woot [15:16:31] PROBLEM - MariaDB sustained replica lag on s1 on db1184 is CRITICAL: 36 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1184&var-port=9104 [15:16:31] Are dumps back? [15:16:33] PROBLEM - MariaDB sustained replica lag on s1 on db1207 is CRITICAL: 69 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104 [15:16:33] PROBLEM - MariaDB sustained replica lag on s1 on db1196 is CRITICAL: 23 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1196&var-port=9104 [15:16:35] PROBLEM - MariaDB sustained replica lag on s1 on db1169 is CRITICAL: 19.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1169&var-port=9104 [15:16:35] PROBLEM - MariaDB sustained replica lag on s1 on db1186 is CRITICAL: 12.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104 [15:16:41] RECOVERY - MariaDB sustained replica lag on s1 on db2130 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2130&var-port=9104 [15:16:43] That shouldn't be as it is affecting all of them [15:16:55] PROBLEM - MariaDB sustained replica lag on s1 on db1218 is CRITICAL: 30.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1218&var-port=9104 [15:17:00] But it seems like a small spike as orchestrator looks good [15:17:01] RECOVERY - MariaDB sustained replica lag on s1 on db2153 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2153&var-port=9104 [15:17:01] RECOVERY - MariaDB sustained replica lag on s1 on db2216 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2216&var-port=9104 [15:17:05] PROBLEM - MariaDB sustained replica lag on s1 on db1235 is CRITICAL: 42 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1235&var-port=9104 [15:17:09] PROBLEM - MariaDB sustained replica lag on s1 on db1195 is CRITICAL: 16.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1195&var-port=9104 [15:17:11] PROBLEM - MariaDB sustained replica lag on s1 on db1234 is CRITICAL: 82.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1234&var-port=9104 [15:17:13] PROBLEM - MariaDB sustained replica lag on s1 on db1232 is CRITICAL: 36.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1232&var-port=9104 [15:17:35] RECOVERY - MariaDB sustained replica lag on s1 on db1186 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104 [15:17:59] RECOVERY - MariaDB sustained replica lag on s1 on db2145 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2145&var-port=9104 [15:18:09] RECOVERY - MariaDB sustained replica lag on s1 on db1195 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1195&var-port=9104 [15:18:13] RECOVERY - MariaDB sustained replica lag on s1 on db1232 is OK: (C)10 ge (W)5 ge 4.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1232&var-port=9104 [15:18:31] RECOVERY - MariaDB sustained replica lag on s1 on db1184 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1184&var-port=9104 [15:18:33] RECOVERY - MariaDB sustained replica lag on s1 on db1196 is OK: (C)10 ge (W)5 ge 3.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1196&var-port=9104 [15:18:35] RECOVERY - MariaDB sustained replica lag on s1 on db1169 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1169&var-port=9104 [15:18:46] Looks like the master had a spike on writes [15:18:55] RECOVERY - MariaDB sustained replica lag on s1 on db1218 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1218&var-port=9104 [15:19:05] RECOVERY - MariaDB sustained replica lag on s1 on db1219 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1219&var-port=9104 [15:19:05] RECOVERY - MariaDB sustained replica lag on s1 on db1235 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1235&var-port=9104 [15:19:09] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [15:19:27] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db2203&var-port=9104&from=1735829241209&to=1735829767100&viewPanel=12 [15:19:33] RECOVERY - MariaDB sustained replica lag on s1 on db1207 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104 [15:20:11] RECOVERY - MariaDB sustained replica lag on s1 on db1234 is OK: (C)10 ge (W)5 ge 1.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1234&var-port=9104 [15:28:30] Amir1: let me know what you think of https://gitlab.wikimedia.org/toolforge-repos/switchmaster/-/merge_requests/7 [15:32:22] marostegui: I'm happy with the change! [15:32:36] Actually the write spike was me. There wasn't any way around it. Sorry [15:32:44] but it was a one-off [15:33:02] * marostegui stares at Amir1 [15:33:10] Amir1: I will merge then! [15:33:36] Let me know if you don't have merge rights [15:33:50] sorry about the writes [15:34:02] Amir1: I do, can you re-run the switchmaster service so it picks up the new template? it doesn't have to be now anyway [15:34:26] sure [15:35:11] Amir1: Better to know it was you! [15:35:18] https://www.irccloud.com/pastebin/yuXvHJRs/ [15:35:25] done ^ [15:35:31] Thanks! [15:38:05] Amir1: it looks good https://phabricator.wikimedia.org/T382900 [15:40:13] The mess was because of deleting only 3k rows: https://phabricator.wikimedia.org/T54778#10426991 the problem was those rows were hidden among 3M rows and there wasn't any index on them.