[04:36:28] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2166:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:28] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2166:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:28] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2166:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:49:48] FIRING: MysqlReplicationLag: MySQL instance db2166:9104@s8 has too large replication lag (18h 24m 53s). Its replication source is db2165.codfw.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [06:49:48] FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2166:9104 has too large replication lag (18h 24m 53s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [07:35:15] Depooling pc1 for maintenance [07:37:02] pc1011:~# uptime [07:37:02] 07:36:56 up 409 days, [07:37:04] It is about time [07:37:19] and pc2011 is 357 days [08:25:19] Going for pc2 now [08:27:57] volans: Is this a bug? https://phabricator.wikimedia.org/P72243 [09:19:48] RESOLVED: MysqlReplicationLag: MySQL instance db2166:9104@s8 has too large replication lag (3m 14s). Its replication source is db2165.codfw.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [09:19:48] RESOLVED: MysqlReplicationLagPtHeartbeat: MySQL instance db2166:9104 has too large replication lag (3m 14s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [09:20:45] looking [09:24:12] marostegui: it got executed with 'pc2012' not 'pc2012*' and the downtime cookbook wants a cumin query. I noticed you didn't have quotes in your command. By any chance do you have a file named "pc2012" in the directory from where you launched the cookbook? [09:24:29] in that case bash would have resolved the pc2012* to pc2012 and hence the cookbook failing [09:29:14] ha, that's an interesting one [09:29:31] $ file pc2012 [09:29:31] pc2012: ASCII text, with very long lines [09:29:35] eheheh [09:29:46] I don't have pc1012 file [09:29:47] bash globbing got you [09:29:52] hahaha good one [10:26:49] downtimed ms-be2075 for another week :-/ T382707 [10:26:50] T382707: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707 [12:14:22] Amir1: I pick https://phabricator.wikimedia.org/T384592 and you do the table creation? [12:14:50] sure! [12:14:57] Good! [12:15:03] Thanks! [12:15:04] Will send the gitlab merge request in a bit [12:28:09] btw, the table is quite tiny in s8: [12:28:12] https://www.irccloud.com/pastebin/DAj2jWBk/ [12:28:29] you can run it with replication or use the live option [12:29:08] Ah, nice, I will do it directly on the master [13:01:53] Emperor: I no longer ping you when I do a backup speedup (e.g. after maintenance) because I belive you (or the metrics) barelly notice it [13:02:10] as I do it with very low concurrency [13:10:42] moving the convo from operations here. alerts on prometheus seem to be currently using seconds behind master [13:11:42] flagging it in the context of T321808 [13:11:43] T321808: Port all Icinga checks to Prometheus/Alertmanager - https://phabricator.wikimedia.org/T321808 [13:58:34] as clinic duty backlog seems to have almost cleared, going for a coffee while db1239 gets reimaged [13:59:08] I think manuel will be happy with the resolution of T366092 [13:59:09] T366092: Upgrade eqiad mediabackups database hosts to Debian Bookworm - https://phabricator.wikimedia.org/T366092 [13:59:20] I definitely am! [13:59:24] Thank you :* [16:11:28] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:28] ^fixed [16:26:25] Switching s2 eqiad master [16:32:42] Damn semi sync [16:33:03] ? [16:33:14] It usually causes hangs when doing all the switchover things [16:33:38] I am so tired of it [16:33:53] It leaves the host so stuck that the only way to recover is to actually kill mariadb [16:33:56] It is terrible [16:34:13] I've not been able to reproduce it all the time to send a proper bug [16:34:50] at what step does it happen? [16:35:00] It is very random [16:35:23] It happened after a few successful moves [16:36:58] I think in general 10.6 has some reliability problems. More features but also more issues. [16:42:45] jynus: I will do s2 eqiad master switch and when db1239 is up to date (tomorrow probably) I will move it under the new master, so nothing for you to do there. I will take care of it [16:45:06] table rebuilding is ongoing [16:45:13] Yeah no worries [16:50:16] I am checking and backup sources have binlong enabled [16:50:37] I cannot remember if that was changed, or it was always like that [16:50:43] I don't remember [16:50:45] but we could leave those on ROW [16:50:51] yeah, +1 for that [16:51:08] so I can backup those binlogs, as they are unlikely to be setup as masters [16:51:22] as wel as the direct STATEMENT compy from the primary [16:51:42] *copy [16:51:46] yeah [16:51:59] And we can have 3 [16:52:07] 3+ months of binlogs on dbprovs [16:52:34] as, as I demonstrated on a bet to you, compressed they will take almost no space [16:52:39] :-D