[08:23:45] swfrench-wmf: https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting#Tables%2Findex_corruption [08:24:20] I think in practice it's OK to just depool and open a task, but the cleanup instructions got moved to wikitech [08:30:43] and the DBAs are working on upgrading to a version where we should stop seeing this bug too [08:37:19] Sorry, I missed scott's comment as my bouncer has been acting funny [08:38:02] swfrench-wmf: It was moved to wikitech as Emperor said cause it is taking more than 3 months (is it 3 months when we move them to wikitech sobanski ?). We are actively working on it. Depooling and opening a task is good enough [08:38:19] We are proactively now upgrading sections and rebuilding tables to minimize the chances of them happening randomly [08:39:05] Yes, 3 months [08:40:30] But it’s a good point, I’ll respond to the original thread and add it to the instructions [09:03:59] Going to change es1 master [09:09:47] PROBLEM - MariaDB sustained replica lag on es6 on es2036 is CRITICAL: 92.75 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2036&var-port=9104 [09:10:25] FIRING: SystemdUnitFailed: ferm.service on es2036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:49] RECOVERY - MariaDB sustained replica lag on es6 on es2036 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2036&var-port=9104 [09:15:25] RESOLVED: SystemdUnitFailed: ferm.service on es2036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:55] FIRING: [2x] SystemdUnitFailed: ferm.service on es2030:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:22] ^ All those are part of the upgrades [09:27:55] RESOLVED: [2x] SystemdUnitFailed: ferm.service on es2030:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:46] db1150 tables are being rebuilt [12:28:25] I love change master to master_delay [12:28:29] so useful for testing [12:29:56] I'm curious now [12:30:33] Nah, I was scripting the rebuild table script to wait for lag before repooling [13:26:48] PROBLEM - MariaDB sustained replica lag on s3 on db1166 is CRITICAL: 2798 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1166&var-port=9104 [13:35:48] RECOVERY - MariaDB sustained replica lag on s3 on db1166 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1166&var-port=9104 [14:20:53] should that be converted to a cookbook? :-P /me hides [14:22:31] "patches welcome" ;p [14:37:16] Hi DBAs! Seems like this schema change from last week wasn't fully applied in production: https://phabricator.wikimedia.org/T381759#10500758 I'm really confused and don't know why that might be [14:39:00] (ping marostegui) [14:48:09] Emperor: marostegui: great, thank you for clarifying! :) [15:21:39] I'm on it Daimona [15:21:48] Thank you! [15:41:12] Daimona: should be fixed now [15:42:47] Nice, thanks! I'm wondering what happened, I looked at it for a while but couldn't figure it out. [15:43:40] Daimona: A stupid thing, the long time x1 master has always been db2191, and it is now db2196, so I selected db2191 and when I double checked, I guess my memory stopped reading db219 and not that last digit [15:45:23] Oh, I see. I considered that, it just seemed odd that there was no alert in place (either preventing the change, or reporting a drift after the fact). [15:45:46] But I guess there might be valid reasons as to why that doesn't exist [15:46:28] Daimona: We are working on the schema change automatic reporting yeah [15:46:32] Like drifts [15:47:05] Ahhhhh the table catalog, is it? [15:48:45] Yeah, that's part of it too [15:48:52] But it is not as easy as it looks unfortunately [15:48:58] Many tables, many moving pieces etc [15:50:47] Yep, yep, I imagine. I also saw reports of random test tables, in random wikis, created ages ago for no obvious reason, that need to be cleaned up. Not fun. But for the time being, thank you for saving the day :) [15:51:21] Daimona: I remember when we started to fix the drifts on the revision table indexes and PKs, I think it took 2 years to get it completely fixed everywhere oh well! [15:51:29] Daimona: Thanks for the ping and sorry for blocking the train! [15:53:11] Which is one of many, many reasons to be grateful of the amazing work y'all are doing! And no worries re train, I don't even know if anybody else noticed :) [15:53:29] That's good then :) [16:28:51] dhinus: When do you plan to reboot clouddb* hosts for https://phabricator.wikimedia.org/T376905? If you could just run apt full-upgrade that'd also upgrade mariadb to 10.6.20 [16:42:16] marostegui: ack, I'll upgrade as I reboot. I forgot about that task, I will aim for this week [16:42:46] wait clouddbs are not listed in that task though, are they affected? [16:43:14] dhinus: Are they bookworm? That task is only for hosts owned by us [16:43:24] root@clouddb1015:~# uname -v [16:43:24] I think they are yes [16:43:24] #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) [16:43:29] yeah, just checked [16:44:06] dhinus: https://phabricator.wikimedia.org/T376800 [16:44:19] there you go, thanks [16:45:48] 'P{clouddb*} and A:bookworm' [16:45:52] clouddb[1013-1020].eqiad.wmnet [18:59:31] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:02:07] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure