[01:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 11h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [01:43:41] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:13:26] FIRING: [50x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:28:26] FIRING: [50x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:26] FIRING: [50x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 15h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [06:33:41] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:51] not sure why we have failures related to pki1001 that is insetup right now, and also related to the discovery intermediate that was removed [07:20:53] mmmmm [07:20:54] going to check [07:29:08] ok I think it should be sufficient to restart prometheus-node-exporter.service ok all pki nodes [07:34:14] * elukey bbiab [08:45:06] no ok more difficult - stale file under /var/lib/prometheus/node.d [08:45:08] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 18h 47m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [08:50:08] RESOLVED: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 18h 48m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [08:58:07] hopefully fixed now [09:00:45] moritzm: ok if I depool pki2002 and reimage it to Trixie? I don't recall if it is already moved to nftables, maybe we could merge the change before the reimage as well [09:01:47] I have a patch ready, I can merge it once the reimage has started, then it'll automatically pick up nftables for it's initial puppet run [09:01:59] and the plan sounds good! [09:03:26] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:26] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:26] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:07] moritzm: going also to set UEFI [09:18:26] FIRING: [7x] SystemdUnitFailed: cfssl-ocsprefresh-dse_front_proxy.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:09] moritzm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1286833 if you have a moment [09:23:26] FIRING: [7x] SystemdUnitFailed: cfssl-ocsprefresh-cassandra.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:28] I am trying to upgrade the pki2002's idrac but it is taking ages [09:47:54] ok something is moving [10:21:17] moritzm: you can merge the nftable patch! [10:21:49] bmc+bios firmwares upgraded, preseed changed to uefi + provisioning [10:22:52] on it [10:24:42] elukey: merged, pki2002 is configured for nftables now [10:24:53] super, reimaging [10:28:26] FIRING: [18x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:35] FIRING: DiskSpace: Disk space config-master1001:9100:/ 3.014% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=config-master1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:48:26] FIRING: [20x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:59:35] RESOLVED: DiskSpace: Disk space config-master1001:9100:/ 2.921% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=config-master1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:13:26] FIRING: [20x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:48] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:56:15] 10Mail, 06Infrastructure-Foundations, 10vrts: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11917199 (10Xaosflux) Yahoo SMTP sender admin guide: https://senders.yahooinc.com/best-practices/ [13:29:33] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11917426 (10ayounsi) [13:30:11] 10netops, 06Infrastructure-Foundations, 06Traffic: 2026 Junos upgrade - https://phabricator.wikimedia.org/T416444#11917431 (10ayounsi) [13:30:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11917430 (10ayounsi) [13:35:32] 10netops, 06Infrastructure-Foundations: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 (10ayounsi) 03NEW p:05Triage→03Medium [13:36:01] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11917492 (10ayounsi) [14:21:43] 10netops, 06Infrastructure-Foundations: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11917759 (10ayounsi) [14:22:10] 10netops, 06Infrastructure-Foundations: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11917776 (10ayounsi) [14:22:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11917781 (10cmooney) [14:23:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11917782 (10cmooney) [14:23:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11917784 (10cmooney) [14:24:42] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11917793 (10cmooney) [14:30:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11917850 (10cmooney) In terms of the Nokia configuration for the ports connecting to the CRs set them up like this to create the two needed sub-interfaces... [14:32:36] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11917886 (10ayounsi) [14:34:09] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11917903 (10jcrespo) backup2015 is part of the media backup hosts, I can stop media backups for codfw before the maintenance on my own. [14:40:39] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11917931 (10jijiki) [15:38:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11918235 (10cmooney) Overall the other info in this task makes sense to me. I think we can do all the vlan renames in advance. So when we set up the swi... [15:45:24] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11918255 (10cmooney) [16:13:41] FIRING: [19x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:03] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:37:34] FIRING: DiskSpace: Disk space build2001:9100:/ 1.435% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:58:31] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Create depool hiera keys for cirrussearch hosts - https://phabricator.wikimedia.org/T426228 (10bking) 03NEW [16:59:03] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11918618 (10bking) @ayounsi Sorry for the trouble, confirming that the `depool` and `repool` commands are enough for `cirrussearch` hosts. [18:50:19] 10Mail, 06Infrastructure-Foundations, 10vrts: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11919070 (10Zache) There is likely related error report in finnish Wikipedia: When logging in, a login confirmation message appears, requiring confirmation of the login with a code sent by e... [19:00:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:05:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:13:41] FIRING: [19x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:03] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:37:49] FIRING: DiskSpace: Disk space build2001:9100:/ 1.431% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:46:10] 10Mail, 06Infrastructure-Foundations, 10vrts: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11919758 (10Johannnes89) >>! In T426105#11919070, @Zache wrote: > There is likely related error report in finnish Wikipedia: > > When logging in, a login confirmation message appears, requir... [21:32:05] 10Mail, 06Infrastructure-Foundations, 06Product Safety and Integrity, 06Trust-and-Safety, 10vrts: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11920034 (10kostajh) [22:13:26] FIRING: [20x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed