[00:03:21] FIRING: [8x] GanetiCACertificateAboutToExpire: Ganeti CA certificate is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [02:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti01.svc.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [06:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti01.svc.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [09:53:24] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11348677 (10fgiunchedi) Thank you @cmooney for the summary, I'll add a few thoughts I had while working on the Toolforge on Metal p... [10:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti01.svc.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [14:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti01.svc.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [16:13:26] RESOLVED: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:45] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11350945 (10cmooney) So this is causing a lot of logspam on our Nokia switches right now. What I've noticed before is that our hosts tend to alternate between two LLDP neighbors co... [19:30:06] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11351215 (10cmooney) FWIW this will need further investigation, I've reset a bunch of these switches which will cause the scenario the alerts should fire, but I... [19:57:49] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11351345 (10cmooney) Small update, right now lsw1-d6-eqiad is broken. So this alert should be present for ssw1-d1-eqiad and ssw1-d8-eqiad. [20:04:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti01.svc.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [21:13:44] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11351612 (10colewhite) In today's case, the alert criteria wasn't met because the metrics [[ https://grafana-rw.wikimedia.org/explore?schemaVersion=1&panes=%7B%... [22:17:34] FIRING: DiskSpace: Disk space install1005:9100:/ 6.441% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=install1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:52:34] FIRING: DiskSpace: Disk space install1005:9100:/ 3.445% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=install1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:49:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on install1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed