[00:44:25] RESOLVED: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:26] > FIRING: [18x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [07:46:29] ^^ looking [07:58:20] > RESOLVED: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:06] ^^ cloudidp2001-dev had two entries in the role_owner metrics for a while [08:52:57] headsup, I've started a rebalancing of kafka-logging. I'll intercept any alert in alertmanager and will silence them for ~1 day [11:22:33] {{done}} [11:33:21] thanks! [12:34:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:44:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:48:34] FIRING: DiskSpace: Disk space titan2002:9100:/srv 3.776% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=titan2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:53:34] FIRING: [2x] DiskSpace: Disk space titan1002:9100:/srv 3.776% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:21:56] ^^ Remove unused md2 and add its devices to vg0 on titan1002 (T410152) [14:21:56] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [14:42:18] ^^ Remove unused md2 and add its devices to vg0 on titan2002 (T410152) [14:42:19] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [14:43:34] FIRING: [2x] DiskSpace: Disk space titan1002:9100:/srv 3.563% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:48:34] RESOLVED: [2x] DiskSpace: Disk space titan1002:9100:/srv 3.563% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:04:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:14:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure