[08:22:03] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:08] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:03] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:25] should we silence those for 24h with the DBAs out today? [10:33:19] I don't know what does are [10:33:24] *those [10:34:14] although technically db2239 is mine [10:36:05] I'm handling an UBN, will check it later [10:39:49] actually that's wrong, the role is mine [10:40:11] but this is a test host volans and others were testing something, I was not involved [10:40:20] I will downtime it, and they can handle it later [10:41:23] db2239? not me [10:41:33] unless I'm not recalling it :) [10:42:02] you were involved somehow [10:42:30] I was asked if the dbas could setup a new dbstore, and I said as long it didn't touch an existing one, no problem [10:42:43] probably to test some automation or something [10:43:02] but maybe it was someone else [10:47:34] lmk if you want me to have a look, not sure I can help but I can try [10:54:30] I've downtimed it for 15 days [10:54:48] taking care of something more important right now [10:55:29] auew [10:55:31] *sure [10:57:28] Updating wikireplica views `abuse_filter_action` for T378671 [12:53:34] I rebooted an-redacteddb1001 and it looks like the s8 replication is broken again. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=an-redacteddb1001&service=MariaDB+Replica+SQL%3A+s8 - I know that m.arostegui was working on it last week, but I didn't think that anything was still ongoing. [12:54:57] Not urgent. I'll make a ticket and fess up that I didn't check before rebooting it. [12:57:15] if a schema change was ongoing, you may just increased the recovery time 10 times [12:57:32] as it now has to revert the changes and apply it again [14:05:47] jynus: Understood. I thought it was my chance to get an-redacted1001 rebooted for T376800 before the schema change was started again. m.arostegui said on the 7th that reverting would take around 2 days. Then applying would take another 10-12 days. [14:07:00] my understanding is that that was being executed, but I may be wrong [14:40:44] hey folks! [14:41:02] define folks :-D [14:41:24] yesterday db2217 paged (broken replication), we fixed the index on a table and left the node depooled for your final verification [14:41:27] (no DBAs today) [14:41:38] that answers my question then :D [14:42:03] but you did what I would have done, so let's wait until tomorrow [14:42:10] super thanks! [14:44:48] ok, UBN fixed, taking a break to unstress [16:27:58] jynus: re: https://phabricator.wikimedia.org/T371416#10302257 - shall we set up a plan about what/how to test it? [16:39:27] I was taking care of that, but sadly the previous UBN delayed me [16:39:44] that was next on my list [16:42:50] what I was doing is setting up a host with the previous hw config first before comparing them [16:43:04] however, I will need dc ops assistance for taking out a disk later on [16:51:03] okok super thanks! Lemme know if you need any help [16:53:25] I will certainly keep you updated [16:53:34] <3 [16:53:49] (no rush I was just curious, that's it) [19:54:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1182:9104 has too large replication lag (11m 40s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1182&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [20:09:30] PROBLEM - MariaDB sustained replica lag on s2 on db1182 is CRITICAL: 1442 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1182&var-port=9104 [20:19:48] RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1182:9104 has too large replication lag (2m 5s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1182&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [20:20:30] RECOVERY - MariaDB sustained replica lag on s2 on db1182 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1182&var-port=9104