[04:36:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2166:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:06:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2166:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:41:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2166:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:49:48] <jinxer-wm>	 FIRING: MysqlReplicationLag: MySQL instance db2166:9104@s8 has too large replication lag (18h 24m 53s). Its replication source is db2165.codfw.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[06:49:48] <jinxer-wm>	 FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2166:9104 has too large replication lag (18h 24m 53s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[07:35:15] <marostegui>	 Depooling pc1 for maintenance
[07:37:02] <marostegui>	 pc1011:~# uptime
[07:37:02] <marostegui>	  07:36:56 up 409 days,
[07:37:04] <marostegui>	 It is about time
[07:37:19] <marostegui>	 and pc2011 is 357 days
[08:25:19] <marostegui>	 Going for pc2 now
[08:27:57] <marostegui>	 volans: Is this a bug? https://phabricator.wikimedia.org/P72243
[09:19:48] <jinxer-wm>	 RESOLVED: MysqlReplicationLag: MySQL instance db2166:9104@s8 has too large replication lag (3m 14s). Its replication source is db2165.codfw.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[09:19:48] <jinxer-wm>	 RESOLVED: MysqlReplicationLagPtHeartbeat: MySQL instance db2166:9104 has too large replication lag (3m 14s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[09:20:45] <volans>	 looking
[09:24:12] <volans>	 marostegui: it got executed with 'pc2012' not 'pc2012*' and the downtime cookbook wants a cumin query. I noticed you didn't have quotes in your command. By any chance do you have a file named "pc2012" in the directory from where you launched the cookbook?
[09:24:29] <volans>	 in that case bash would have resolved the pc2012* to pc2012 and hence the cookbook failing
[09:29:14] <marostegui>	 ha, that's an interesting one
[09:29:31] <marostegui>	 $ file pc2012
[09:29:31] <marostegui>	 pc2012: ASCII text, with very long lines
[09:29:35] <volans>	 eheheh
[09:29:46] <marostegui>	 I don't have pc1012 file
[09:29:47] <volans>	 bash globbing got you
[09:29:52] <marostegui>	 hahaha good one
[10:26:49] <Emperor>	 downtimed ms-be2075 for another week :-/ T382707
[10:26:50] <stashbot>	 T382707: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707
[12:14:22] <marostegui>	 Amir1: I pick https://phabricator.wikimedia.org/T384592 and you do the table creation?
[12:14:50] <Amir1>	 sure!
[12:14:57] <marostegui>	 Good!
[12:15:03] <Amir1>	 Thanks!
[12:15:04] <marostegui>	 Will send the gitlab merge request in a bit
[12:28:09] <Amir1>	 btw, the table is quite tiny in s8:
[12:28:12] <Amir1>	 https://www.irccloud.com/pastebin/DAj2jWBk/
[12:28:29] <Amir1>	 you can run it with replication or use the live option
[12:29:08] <marostegui>	 Ah, nice, I will do it directly on the master
[13:01:53] <jynus>	 Emperor: I no longer ping you when I do a backup speedup (e.g. after maintenance) because I belive you (or the metrics) barelly notice it
[13:02:10] <jynus>	 as I do it with very low concurrency
[13:10:42] <jynus>	 moving the convo from operations here. alerts on prometheus seem to be currently using seconds behind master
[13:11:42] <jynus>	 flagging it in the context of T321808
[13:11:43] <stashbot>	 T321808: Port all Icinga checks to Prometheus/Alertmanager - https://phabricator.wikimedia.org/T321808
[13:58:34] <jynus>	 as clinic duty backlog seems to have almost cleared, going for a coffee while db1239 gets reimaged
[13:59:08] <jynus>	 I think manuel will be happy with the resolution of T366092
[13:59:09] <stashbot>	 T366092: Upgrade eqiad mediabackups database hosts to Debian Bookworm - https://phabricator.wikimedia.org/T366092
[13:59:20] <marostegui>	 I definitely am!
[13:59:24] <marostegui>	 Thank you :*
[16:11:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:14:28] <jynus>	 ^fixed
[16:26:25] <marostegui>	 Switching s2 eqiad master
[16:32:42] <marostegui>	 Damn semi sync
[16:33:03] <jynus>	 ?
[16:33:14] <marostegui>	 It usually causes hangs when doing all the switchover things
[16:33:38] <marostegui>	 I am so tired of it
[16:33:53] <marostegui>	 It leaves the host so stuck that the only way to recover is to actually kill mariadb
[16:33:56] <marostegui>	 It is terrible
[16:34:13] <marostegui>	 I've not been able to reproduce it all the time to send a proper bug
[16:34:50] <jynus>	 at what step does it happen?
[16:35:00] <marostegui>	 It is very random
[16:35:23] <marostegui>	 It happened after a few successful moves
[16:36:58] <jynus>	 I think in general 10.6 has some reliability problems. More features but also more issues.
[16:42:45] <marostegui>	 jynus: I will do s2 eqiad master switch and when db1239 is up to date (tomorrow probably) I will move it under the new master, so nothing for you to do there. I will take care of it
[16:45:06] <jynus>	 table rebuilding is ongoing
[16:45:13] <marostegui>	 Yeah no worries
[16:50:16] <jynus>	 I am checking and backup sources have binlong enabled
[16:50:37] <jynus>	 I cannot remember if that was changed, or it was always like that
[16:50:43] <marostegui>	 I don't remember
[16:50:45] <jynus>	 but we could leave those on ROW
[16:50:51] <marostegui>	 yeah, +1 for that
[16:51:08] <jynus>	 so I can backup those binlogs, as they are unlikely to be setup as masters
[16:51:22] <jynus>	 as wel as the direct STATEMENT compy from the primary
[16:51:42] <jynus>	 *copy
[16:51:46] <marostegui>	 yeah
[16:51:59] <jynus>	 And we can have 3
[16:52:07] <jynus>	 3+ months of binlogs on dbprovs
[16:52:34] <jynus>	 as, as I demonstrated on a bet to you, compressed they will take almost no space
[16:52:39] <jynus>	 :-D