[03:14:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:14:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:56:38] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11975082 (10ops-monitoring-bot) Draining ganeti2045.codfw.wmnet of running VMs [07:13:02] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11975129 (10ops-monitoring-bot) Completed depooling of db2241 by fceratto@cumin1003: Depool for rack maintenance [07:15:12] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11975135 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=55ec911b-df34-41d6-a145-ff36a55ba765) set by fceratto@cumin1... [07:23:54] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11975160 (10ops-monitoring-bot) Starting pool of db2241 by fceratto@cumin1003: Depool for rack maintenance [07:25:27] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11975162 (10ayounsi) [07:26:12] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11975165 (10ops-monitoring-bot) Depooled pc1021.eqiad.wmnet and pc2021.codfw.wmnet rack A3 maintenance - fceratto@cumin1003 - T427301 [07:26:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11975168 (10ayounsi) [07:26:49] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11975169 (10ayounsi) [07:26:56] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11975170 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=45ec6ca5-e7dc-4aac-829a-479be0c8c095) set by fceratto@cumin1003 for 2 days, 0:00:00 on 1 host(... [07:29:00] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11975171 (10ops-monitoring-bot) Completed depooling of db2158 by fceratto@cumin1003: rack A3 maintenance [07:29:43] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11975176 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dd2e4787-ea9c-4ac3-a947-eed9b2dfef8b) set by fceratto@cumin1003 for 2 days, 0:00:00 on 1 host(... [08:09:19] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11975294 (10ops-monitoring-bot) Completed pooling of db2241 by fceratto@cumin1003: Depool for rack maintenance [08:39:46] 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations, 13Patch-For-Review: Change puppet-merge git origin to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org - https://phabricator.wikimedia.org/T420184#11975411 (10ABran-WMF) >>! In T420184#11968357, @Dzahn wrote: > The stri... [08:59:22] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11975451 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=79ff0806-933d-49e7-8c67-71ba7d45bc8b) set by fceratto@cumin1003 for 2 days, 0:00:00 on 1 host(... [08:59:44] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11975452 (10FCeratto-WMF) [08:59:54] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11975453 (10FCeratto-WMF) [09:01:22] 10netops, 06Infrastructure-Foundations, 06SRE: cr2-drmrs unexpected reboot - https://phabricator.wikimedia.org/T427600#11975473 (10cmooney) {F86180058} [09:28:28] FIRING: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:30:25] FIRING: SystemdUnitFailed: prometheus-node-textfile-export_service_type.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:03] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11976107 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7ff06b6f-10ee-45fc-a0d8-f9e445c2a726) set by ayounsi@cumin1003 for 1:00:00 on 27 host(s) and t... [12:11:16] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11976109 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2cc85cc5-a5e3-45c9-b39d-9775610ef6e4) set by ayounsi@cumin1003 for 1:00:00 on 3 host(s) and th... [14:02:26] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11976595 (10ayounsi) 05Open→03Resolved a:03ayounsi A3 switch upgrade went fine, ~11min switch downtime plus a few more for the interfaces to come back up. We ha... [14:03:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11976599 (10ayounsi) [14:54:26] 10netops, 06Infrastructure-Foundations: Nokia: implement maintenance mode - https://phabricator.wikimedia.org/T419673#11976869 (10ayounsi) [15:30:01] 10netops, 10SRE-tools, 06Infrastructure-Foundations, 06SRE: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762#11977272 (10ayounsi) 05Open→03Resolved a:03ayounsi No need to keep that old parent task open. [15:33:26] 10netops, 06Infrastructure-Foundations, 06SRE: Occasional high ICMP probe response from codfw to cr1-drmrs - https://phabricator.wikimedia.org/T315645#11977334 (10cmooney) 05Stalled→03Declined [15:53:36] 10netbox, 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007#11977548 (10ayounsi) 05Open→03Resolved a:03ayounsi Looks like the current provisioning process with the `Port with no description on access switch` ale... [16:00:25] FIRING: SystemdUnitFailed: prometheus-puppet-ca-exporter.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:25] RESOLVED: SystemdUnitFailed: prometheus-puppet-ca-exporter.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:43] 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations, 13Patch-For-Review: Change puppet-merge git origin to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org - https://phabricator.wikimedia.org/T420184#11977655 (10Dzahn) Ah, it's in the puppetserver module. Thanks! The pat... [16:08:15] 10netops, 06Infrastructure-Foundations, 06SRE: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364#11977663 (10ayounsi) 05Open→03Resolved I've renamed it to `Interface UP for 7 days with no description` when migrating the alert to AlertManager. Please... [20:29:08] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11979041 (10jijiki)