[02:31:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:42:34] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:31:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:34] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:31:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:02:09] PROBLEM - MariaDB sustained replica lag on s4 on db2147 is CRITICAL: 25.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2147&var-port=9104 [11:18:09] RECOVERY - MariaDB sustained replica lag on s4 on db2147 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2147&var-port=9104 [11:28:09] PROBLEM - MariaDB sustained replica lag on s4 on db2147 is CRITICAL: 14.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2147&var-port=9104 [11:42:34] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:45:09] PROBLEM - MariaDB sustained replica lag on s4 on db2147 is CRITICAL: 12.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2147&var-port=9104 [11:59:48] FIRING: MysqlReplicationLag: MySQL instance db2147:9104@s4 has too large replication lag (3m 10s). Its replication source is db2179.codfw.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2147&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [11:59:48] FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2147:9104 has too large replication lag (3m 10s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2147&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [12:13:16] I think this was related to T368098 (cc Amir1 ) [12:13:17] T368098: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [12:48:11] RECOVERY - MariaDB sustained replica lag on s4 on db2147 is OK: (C)10 ge (W)5 ge 1.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2147&var-port=9104 [14:31:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:34] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:31:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:42:34] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:31:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on ms-be1065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:42:34] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure