[03:11:10] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:10] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:10] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:48] btullis: o/ how did the upgrade and reimage go? [07:21:10] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:10] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:30] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11939628 (10MoritzMuehlenhoff) [09:21:17] elukey: We're about to find out. I upgraded the idrac, but is was only two patch versions. I have re-run the provisioning cookbook, and I've tried another bookworm reimage. It's about to boot into the newloy installed system. [09:22:56] Nope. No change. Cookbook still unable to poll for uptime. [09:23:33] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11939641 (10ayounsi) [09:24:01] root password still not being accepted over the serial console. [09:33:15] btullis: did you see if the install step happened twice? [09:34:07] It did not on this occasion. I'm pretty certain that it only happened once. [09:35:03] lemme check reimage to understand what's happening then [09:35:46] is it using UEFI or legacy? We may try to flip the boot method via provision, and test if anything changes [09:38:25] yep confirmed it is using uefi [09:44:19] btullis: ah wait a sec [09:44:44] I bet I've done something stupid... [09:45:02] nono I am reasoning out loud [09:45:23] we have this bit injected by puppet in our base images [09:45:24] d-i preseed/late_command string wget -O /tmp/late_command http://apt.wikimedia.org/autoinstall/scripts/late_command.sh && sh /tmp/late_command [09:45:48] I noticed that you are using a preseed config with an extra recipe, that adds d-i preseed/late_command string [09:46:18] it may override the former, or doing something unexpected [09:46:30] in late_command we setup the root's keys etc.. [09:46:54] Ah, yes. That makes sense. Thanks. [09:47:24] maybe let's try to reimage with only [09:47:26] - partman/standard-efi.cfg [09:47:26] - partman/raid0-2dev-lvm-efi.cfg [09:47:43] would it work? Just to understand if the late_command is the problem [09:47:47] then we can workaround it [09:48:55] or we can comment that bit on the install/apt server manually [09:49:17] yes lemme do it, and then we can reimage again [09:50:27] btullis: done, let's kick off another reimage [09:50:28] Yes, please do go ahead. [09:52:41] Oh, shall I kick it off? [09:53:42] Started the reimage now. [10:10:38] I have just discovered this module, which I can use instead of my partman and preseed/late-command combination: https://github.com/wikimedia/operations-puppet/tree/production/vendor_modules/lvm [10:12:22] sorry didn't see the msgs till now [10:14:08] "Got uptime for hosts dse-k8s-wdqs-test1001.eqiad.wmnetà [10:14:10] ah nice! [10:14:57] btullis: it looks like it worked right? [10:14:58] It's all good, thanks. It booted correctly and it's on its first puppet run now, so you correctly diagnosed my self-inflicted error. [10:16:14] Did you just manually comment out these lines? https://github.com/wikimedia/operations-puppet/blob/production/modules/install_server/files/autoinstall/partman/custom/kubernetes-node-containerd-raid-unallocated-efi.cfg#L21-L24 [10:26:21] exactly yes, on apt1002 [10:26:25] I am going to revert now [11:09:11] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289921 [12:01:10] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:55] btullis: It looks good, for the future: https://gitlab.wikimedia.org/repos/sre/preseed-test [12:21:10] FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:10] FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:22] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11940555 (10JMeybohm) [13:16:08] FIRING: [4x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:10] FIRING: [4x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:01] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941232 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b79d6725-839f-4b80-8718-5cb7000c8fbf) set by pt1979@cumin1003 for 2:00:00 on 3 host(s) and their services with reason: s... [15:04:16] 10SRE-tools, 10Observability-Alerting, 06SRE Observability (FY2025/2026-Q1): Cookbook sre.hosts.remove_downtime does not remove silences - https://phabricator.wikimedia.org/T395032#11941331 (10LSobanski) Untagging #sre-tools, please loop us back in if needed. [15:48:29] Hiii, I just pushed a change on firewall of dbproxies https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289369 it should be noop but given the diff that says netmon and idp has been moved, i give a heads up that if these services stopped being able to talk to the databases, this patch could be the reason [15:49:02] just abundance of caution [16:10:39] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941641 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=57a4ad63-f533-4335-a960-7d2139446ca8) set by pt1979@cumin1003 for 2:00:00 on 3 host(s) and their services with reason: s... [16:11:50] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10VPS-project-Phabricator: @wikimedia.org email addresses don't seem to be receiving emails sent by the test Phabricator instance - https://phabricator.wikimedia.org/T422559#11941650 (10A_smart_kitten) >>! In T422559#11938383, @jhathaway wrote... [16:36:46] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941779 (10Papaul) [17:15:01] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941945 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fb6f5879-0c86-4cca-be43-4b2cb4494d10) set by pt1979@cumin1003 for 2:00:00 on 4 host(s) and their services with reason: s... [17:21:10] FIRING: [2x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:23:04] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941963 (10Papaul) [18:04:55] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11942089 (10Papaul) [18:05:28] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11942095 (10Papaul) 05Open→03Resolved All 3 routers are now up to date. [19:05:17] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10VPS-project-Phabricator: @wikimedia.org email addresses don't seem to be receiving emails sent by the test Phabricator instance - https://phabricator.wikimedia.org/T422559#11942353 (10Dzahn) Thanks for the root cause @jhathaway! Well, I gu... [19:35:12] I have a host (kafka-jumbo1016) seemingly stuck in PXE boot. I've been able to take a screenshot via the IDRAC web ui. https://wikimedia.slack.com/archives/C055QGPTC69/p1779303718528979?thread_ts=1779274323.128399&cid=C055QGPTC69. Does this ring a bell for anyone? Thank you! [19:37:05] all other kafka reimages went fine, but this one gets stuck even at the 2nd retry [20:23:42] brouberol: I see this error, https://ipxe.org/err/280860, but I can't recall if that is always present? [20:24:01] how long did you wait on the hang? does it ever timeout? [20:24:12] all 240 retries [20:24:37] see s.ukhe's response on #-sre. He suggested I reprovisioned the host with the --legacy host, as our parted recipes are using BIOS [20:24:55] that didn't go well either sadly. as the cookbook wasn't able to shut down the host [20:24:59] hmm [20:25:19] I'm re-attempting to reimage it it see whether it still fails the same way atm [20:25:29] sounds good [21:21:10] FIRING: [2x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:20] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10VPS-project-Phabricator: @wikimedia.org email addresses don't seem to be receiving emails sent by the test Phabricator instance - https://phabricator.wikimedia.org/T422559#11942880 (10A_smart_kitten) >>! In T422559#11942353, @Dzahn wrote: >... [23:02:55] 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11943144 (10Papaul) I sent a follow up email on this and Engineer said he will get back with me