[00:35:35] FIRING: [2x] DiskSpace: Disk space mwlog1002:9100:/srv 3.973% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:45:35] FIRING: [2x] DiskSpace: Disk space mwlog1002:9100:/srv 3.937% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:45:40] FIRING: SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:55:25] FIRING: [2x] SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:35] FIRING: [2x] DiskSpace: Disk space mwlog1002:9100:/srv 3.712% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:35:35] FIRING: [2x] DiskSpace: Disk space mwlog1002:9100:/srv 3.96% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:50:35] FIRING: [2x] DiskSpace: Disk space mwlog1002:9100:/srv 3.881% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:50:35] RESOLVED: DiskSpace: Disk space mwlog2002:9100:/srv 3.231% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=mwlog2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:55:40] FIRING: [2x] SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:40] FIRING: [2x] SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:19] I'm soon (probably next week) going to have to do some maintenance on the eqiad D7 ToR switch. It have two of your servers: 'logging-hd1003', 'prometheus1007', do they need to be depooled, and if so is there a cookbook for it? [11:19:15] I'm also looking at it with https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1239896 in mind [12:55:40] FIRING: [2x] SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:14:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:19:05] hey XioNoX the HA model for Prometheus doesn't require any depooling, as for the logging host, I'll let you talk to cwhite [16:55:40] FIRING: [2x] SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:02:40] XioNoX: OpenSearch doesn't either, technically, but there's a switch we can flip to mitigate the churn caused when the cluster detects a down node. There's no cookbook for it, unfortunately. Let me know when you want to do it so I can prepare the cluster? [19:24:34] FIRING: DiskSpace: Disk space mwlog2002:9100:/srv 3.967% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=mwlog2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:55:40] FIRING: [2x] SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:34] FIRING: [2x] DiskSpace: Disk space mwlog1002:9100:/srv 3.96% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace