[03:09:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10524070 (10ayounsi) Sure, as usual for power/console/mgmt. Regarding production ports : On the ssw1 side: `use `et-0/0/7` towards e8 and `et-0/0/15` tow... [08:25:28] the abover errors are due to Error: unknown command "backup" for "etcdctl" [08:59:55] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:45] 10netops, 06Infrastructure-Foundations, 06SRE: Extend sre.network.configure-switch-interfaces cookbook to add sflow and qos config - https://phabricator.wikimedia.org/T379549#10524854 (10cmooney) 05Open→03Resolved [11:50:31] 10netops, 06Infrastructure-Foundations, 06SRE: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#10524944 (10cmooney) 05Open→03Resolved >>! In T381175#10520327, @ayounsi wrote: > For (1) we can have the `sre.ganeti.addnode` cookbook call... [12:54:00] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10525215 (10cmooney) >>! In T384731#10516013, @ayounsi wrote: > An alternative (or short term solution... [12:56:10] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10525220 (10cmooney) >>! In T384052#10516521, @ayounsi wrote: > I'm wondering if we could re-write the "instance" in Prometheus t... [12:59:55] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:11] 10netops, 06Infrastructure-Foundations, 10ops-magru: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10525590 (10cmooney) I've added BFD to this particular session now. Not that it will fix things but it should give us more datapoints for the (likely) case with Ju... [14:54:55] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:02] 10netops, 06Infrastructure-Foundations, 10ops-magru: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10525894 (10ayounsi) Good idea regarding BFD. From https://supportportal.juniper.net/s/article/Observing-BGP-IO-ERROR-CLOSE-SESSION-error-logs-when-BGP-protocolgoes... [16:19:55] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed