[00:23:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:23:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:28:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:54] ^fixed [07:13:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:45] ^ there seems to be some lag, the alert is not even on the dashboard anymore [07:23:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:58] pc2012 is back but replication has broken badly on pc1012 I am fixing it [10:11:52] upgrading db2184 to mariadb 10.11 [10:12:06] (backup1-codfw replica) [11:01:31] Amir1: does this looks good to you? https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/jobs/541314 [11:18:32] es3 and es4 backups are about to finish, leaving for the weekend only es5 [11:26:10] es2046 complains seem to be the dump [11:29:45] I think 10.11 makes backups (dumps) more performant, which would be ok, except that means more resources consumed [11:31:08] I see it not bad ATM, plus those are old revisions; but something to keep an eye to tune new backups fo es6 & es7 [11:32:14] it is also because es2045 is down, so it is getting all production & dump load [11:36:10] I commented on ticket [12:17:18] have a nice weekend [12:44:15] FIRING: MysqlHostIoPressure: MySQL instance es2046:9100 has too much pressure on its io capabilities: (430.8m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es2046%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [13:00:34] marostegui: do you have a recommendation around https://phabricator.wikimedia.org/T397453 ? (pooling back sooner vs monitoring the host) [13:39:15] RESOLVED: MysqlHostIoPressure: MySQL instance es2046:9100 has too much pressure on its io capabilities: (410m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es2046%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [13:48:27] Amir1: maybe you have any take on ^^^ ? [14:05:56] <_joe_> I'm not a dba, but I think it makes sense to put it back in rotation [14:06:27] <_joe_> we can send an email to sre-at-large@ with instructions to depool it in case of another crash over the weekend [14:49:44] I asked infra foundation if they are tracking statistics on firmware+kernel combos reliability [14:50:41] ...they are not, so we can decide independently [15:08:14] FIRING: MysqlHostIoPressure: MySQL instance es2046:9100 has too much pressure on its io capabilities: (422.1m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es2046%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [15:13:15] RESOLVED: MysqlHostIoPressure: MySQL instance es2046:9100 has too much pressure on its io capabilities: (400.1m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es2046%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [15:18:37] as es2046 is unhappy with the io traffic I think we can just repool 2045 night now [15:19:12] I'll keep an eye for pages during the rest of the day [15:25:43] <_joe_> cdanis, swfrench-wmf ^^ in relation with the es failure of this morning [15:26:16] ack, thanks! [16:23:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:31] a lot of slow replication [20:23:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:25:58] no, 6 db* are not replicating [20:38:27] ...and they all seem to be backup sources perhaps running a backup workload now...? [22:10:07] federico3: backup sources stop replication intentionally to basically have a snapshot. That's intentional [22:11:02] Amir1: that's what I was thinking seeing the all the db started doing reads [22:11:05] thanks