[00:02:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11338235 (10Papaul) [00:04:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11338243 (10Papaul) @cmooney thanks for the feedback we can clarify this tomorrow during the meeting and have all ready and run it by @ayounsi when he is back. [00:07:51] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11338254 (10Papaul) [00:10:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11338258 (10Papaul) [00:53:25] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti.example.com is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [01:11:43] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:11:44] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:44:40] 10Mail, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137 (10DamianZaremba) 03NEW [01:58:42] 10Mail, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11338499 (10DamianZaremba) [01:59:25] 10Mail, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11338500 (10DamianZaremba) Tagging SRE as not sure which team is responsible. [04:53:25] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti.example.com is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [05:11:43] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:11:44] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:02:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:24] ^ the build alert is caused by the terrible JDK test suite, I'm building Bullseye/Bookworm forward ports of the latest OpenJDK8 security release [08:42:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:25] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti.example.com is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [08:57:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:44] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:11:48] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:53:25] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti.example.com is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [12:57:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:44] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:11:44] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:22:32] ^ I think the staleness alert is related to the decom of the last remaining mw baremetal servers (mwdebug), I've manually moved mediawiki-conftool-state.prom out of /var/lib/prometheus/node.d, let's see if that fixes it [13:27:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:28] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178 (10cmooney) 03NEW p:05Triage→03Medium [13:39:40] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178#11340040 (10cmooney) [14:02:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:55] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:30] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#11340457 (10elukey) 05Open→03Resolved a:03elukey I think we can close this task, we have establ... [15:27:45] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: DeprecationWarning: datetime.datetime.utcnow() is deprecated - https://phabricator.wikimedia.org/T401581#11340469 (10elukey) 05Open→03Resolved a:03elukey The fix will go out in the next Spicerack release, closing! [16:53:25] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti.example.com is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [17:11:44] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:11:44] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:09:10] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11341689 (10VRiley-WMF) Spoke to @cmooney about this ticket. This no longer has to be mo... [18:51:24] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11341696 (10VRiley-WMF) Spoke to @cmooney about this ticket. This no longer has to be mo... [18:58:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11341710 (10cmooney) >>! In T405628#11341689, @VRiley-WMF wrote: > Spoke to @cmooney abo... [20:53:25] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti.example.com is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [21:05:08] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11342084 (10cmooney) >>! In T405609#11341696, @VRiley-WMF wrote: > Spoke to @cmooney abo... [21:11:44] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:11:44] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:13:16] 10netops, 06Infrastructure-Foundations, 06SRE: Rancid network backups not being synced to git properly - https://phabricator.wikimedia.org/T409217 (10cmooney) 03NEW p:05Triage→03Medium [21:20:50] 10netops, 06Infrastructure-Foundations, 06SRE: Rancid network backups not being synced to git properly - https://phabricator.wikimedia.org/T409217#11342153 (10Dzahn) I would expect the cause is that someone committed as root: ` root@netmon1003:/var/lib/rancid/core/.git/objects# find . -user root ./f2 ./f2/... [22:09:10] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:03] 10netops, 06Infrastructure-Foundations, 06SRE: Rancid network backups not being synced to git properly - https://phabricator.wikimedia.org/T409217#11342444 (10cmooney) Thanks @Dzahn appreciate it! Yep that's what I thought, I will give your suggestion a try in the morning and see does it resolve the problem.