[00:23:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:23:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:28:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:09:54] <jynus>	 ^fixed
[07:13:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:20:45] <jynus>	 ^ there seems to be some lag, the alert is not even on the dashboard anymore
[07:23:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:43:58] <marostegui>	 pc2012 is back but replication has broken badly on pc1012 I am fixing it
[10:11:52] <jynus>	 upgrading db2184 to mariadb 10.11
[10:12:06] <jynus>	 (backup1-codfw replica)
[11:01:31] <federico3>	 Amir1: does this looks good to you? https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/jobs/541314  
[11:18:32] <jynus>	 es3 and es4 backups are about to finish, leaving for the weekend only es5
[11:26:10] <jynus>	 es2046 complains seem to be the dump
[11:29:45] <jynus>	 I think 10.11 makes backups (dumps) more performant, which would be ok, except that means more resources consumed
[11:31:08] <jynus>	 I see it not bad ATM, plus those are old revisions; but something to keep an eye to tune new backups fo es6 & es7
[11:32:14] <jynus>	 it is also because es2045 is down, so it is getting all production & dump load
[11:36:10] <jynus>	 I commented on ticket
[12:17:18] <jynus>	 have a nice weekend
[12:44:15] <jinxer-wm>	 FIRING: MysqlHostIoPressure: MySQL instance es2046:9100 has too much pressure on its io capabilities: (430.8m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es2046%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[13:00:34] <federico3>	 marostegui: do you have a recommendation around https://phabricator.wikimedia.org/T397453 ? (pooling back sooner vs monitoring the host)
[13:39:15] <jinxer-wm>	 RESOLVED: MysqlHostIoPressure: MySQL instance es2046:9100 has too much pressure on its io capabilities: (410m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es2046%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[13:48:27] <federico3>	 Amir1: maybe you have any take on ^^^ ?
[14:05:56] <_joe_>	 I'm not a dba, but I think it makes sense to put it back in rotation 
[14:06:27] <_joe_>	 we can send an email to sre-at-large@ with instructions to depool it in case of another crash over the weekend
[14:49:44] <federico3>	 I asked infra foundation if they are tracking statistics on firmware+kernel combos reliability 
[14:50:41] <federico3>	 ...they are not, so we can decide independently 
[15:08:14] <jinxer-wm>	 FIRING: MysqlHostIoPressure: MySQL instance es2046:9100 has too much pressure on its io capabilities: (422.1m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es2046%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[15:13:15] <jinxer-wm>	 RESOLVED: MysqlHostIoPressure: MySQL instance es2046:9100 has too much pressure on its io capabilities: (400.1m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es2046%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[15:18:37] <federico3>	 as es2046 is unhappy with the io traffic I think we can just repool 2045 night now
[15:19:12] <federico3>	 I'll keep an eye for pages during the rest of the day
[15:25:43] <_joe_>	 cdanis, swfrench-wmf ^^ in relation with the es failure of this morning
[15:26:16] <swfrench-wmf>	 ack, thanks!
[16:23:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:19:31] <federico3>	 a lot of slow replication
[20:23:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:25:58] <federico3>	 no, 6 db* are not replicating
[20:38:27] <federico3>	 ...and they all seem to be backup sources perhaps running a backup workload now...?
[22:10:07] <Amir1>	 federico3: backup sources stop replication intentionally to basically have a snapshot. That's intentional
[22:11:02] <federico3>	 Amir1: that's what I was thinking seeing the all the db started doing reads
[22:11:05] <federico3>	 thanks