[03:11:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:11:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:16:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:20:48] <elukey>	 btullis: o/ how did the upgrade and reimage go?
[07:21:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:01:10] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:16:30] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11939628 (10MoritzMuehlenhoff)
[09:21:17] <btullis>	 elukey: We're about to find out. I upgraded the idrac, but is was only two patch versions. I have re-run the provisioning cookbook, and I've tried another bookworm reimage. It's about to boot into the newloy installed system.
[09:22:56] <btullis>	 Nope. No change. Cookbook still unable to poll for uptime.
[09:23:33] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11939641 (10ayounsi)
[09:24:01] <btullis>	 root password still not being accepted over the serial console.
[09:33:15] <elukey>	 btullis: did you see if the install step happened twice? 
[09:34:07] <btullis>	 It did not on this occasion. I'm pretty certain that it only happened once.
[09:35:03] <elukey>	 lemme check reimage to understand what's happening then
[09:35:46] <elukey>	 is it using UEFI or legacy? We may try to flip the boot method via provision, and test if anything changes
[09:38:25] <elukey>	 yep confirmed it is using uefi
[09:44:19] <elukey>	 btullis: ah wait a sec
[09:44:44] <btullis>	 I bet I've done something stupid...
[09:45:02] <elukey>	 nono I am reasoning out loud
[09:45:23] <elukey>	 we have this bit injected by puppet in our base images
[09:45:24] <elukey>	 d-i	preseed/late_command	string	wget -O /tmp/late_command http://apt.wikimedia.org/autoinstall/scripts/late_command.sh && sh /tmp/late_command
[09:45:48] <elukey>	 I noticed that you are using a preseed config with an extra recipe, that adds d-i preseed/late_command string
[09:46:18] <elukey>	 it may override the former, or doing something unexpected
[09:46:30] <elukey>	 in late_command we setup the root's keys etc..
[09:46:54] <btullis>	 Ah, yes. That makes sense. Thanks.
[09:47:24] <elukey>	 maybe let's try to reimage with only
[09:47:26] <elukey>	     - partman/standard-efi.cfg
[09:47:26] <elukey>	     - partman/raid0-2dev-lvm-efi.cfg
[09:47:43] <elukey>	 would it work? Just to understand if the late_command is the problem
[09:47:47] <elukey>	 then we can workaround it
[09:48:55] <elukey>	 or we can comment that bit on the install/apt server manually
[09:49:17] <elukey>	 yes lemme do it, and then we can reimage again
[09:50:27] <elukey>	 btullis: done, let's kick off another reimage
[09:50:28] <btullis>	 Yes, please do go ahead.
[09:52:41] <btullis>	 Oh, shall I kick it off?
[09:53:42] <btullis>	 Started the reimage now.
[10:10:38] <btullis>	 I have just discovered this module, which I can use instead of my partman and preseed/late-command combination: https://github.com/wikimedia/operations-puppet/tree/production/vendor_modules/lvm
[10:12:22] <elukey>	 sorry didn't see the msgs till now
[10:14:08] <elukey>	 "Got uptime for hosts dse-k8s-wdqs-test1001.eqiad.wmnetà
[10:14:10] <elukey>	 ah nice!
[10:14:57] <elukey>	 btullis: it looks like it worked right?
[10:14:58] <btullis>	 It's all good, thanks. It booted correctly and it's on its first puppet run now, so you correctly diagnosed my self-inflicted error.
[10:16:14] <btullis>	 Did you just manually comment out these lines? https://github.com/wikimedia/operations-puppet/blob/production/modules/install_server/files/autoinstall/partman/custom/kubernetes-node-containerd-raid-unallocated-efi.cfg#L21-L24
[10:26:21] <elukey>	 exactly yes, on apt1002
[10:26:25] <elukey>	 I am going to revert now
[11:09:11] <btullis>	 elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289921
[12:01:10] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:01:55] <elukey>	 btullis: It looks good, for the future: https://gitlab.wikimedia.org/repos/sre/preseed-test
[12:21:10] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:36:10] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:52:22] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11940555 (10JMeybohm)
[13:16:08] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:21:10] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:44:01] <wikibugs>	 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941232 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b79d6725-839f-4b80-8718-5cb7000c8fbf) set by pt1979@cumin1003 for 2:00:00 on 3 host(s) and their services with reason: s...
[15:04:16] <wikibugs>	 10SRE-tools, 10Observability-Alerting, 06SRE Observability (FY2025/2026-Q1): Cookbook sre.hosts.remove_downtime does not remove silences - https://phabricator.wikimedia.org/T395032#11941331 (10LSobanski) Untagging #sre-tools, please loop us back in if needed.
[15:48:29] <Amir1>	 Hiii, I just pushed a change on firewall of dbproxies https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289369 it should be noop but given the diff that says netmon and idp has been moved, i give a heads up that if these services stopped being able to talk to the databases, this patch could be the reason
[15:49:02] <Amir1>	 just abundance of caution  
[16:10:39] <wikibugs>	 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941641 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=57a4ad63-f533-4335-a960-7d2139446ca8) set by pt1979@cumin1003 for 2:00:00 on 3 host(s) and their services with reason: s...
[16:11:50] <wikibugs>	 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10VPS-project-Phabricator: @wikimedia.org email addresses don't seem to be receiving emails sent by the test Phabricator instance - https://phabricator.wikimedia.org/T422559#11941650 (10A_smart_kitten) >>! In T422559#11938383, @jhathaway wrote...
[16:36:46] <wikibugs>	 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941779 (10Papaul)
[17:15:01] <wikibugs>	 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941945 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fb6f5879-0c86-4cca-be43-4b2cb4494d10) set by pt1979@cumin1003 for 2:00:00 on 4 host(s) and their services with reason: s...
[17:21:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:23:04] <wikibugs>	 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941963 (10Papaul)
[18:04:55] <wikibugs>	 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11942089 (10Papaul)
[18:05:28] <wikibugs>	 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11942095 (10Papaul) 05Open→03Resolved All 3 routers are now up to date.
[19:05:17] <wikibugs>	 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10VPS-project-Phabricator: @wikimedia.org email addresses don't seem to be receiving emails sent by the test Phabricator instance - https://phabricator.wikimedia.org/T422559#11942353 (10Dzahn) Thanks for the root cause @jhathaway!   Well, I gu...
[19:35:12] <brouberol>	 I have a host (kafka-jumbo1016) seemingly stuck in PXE boot. I've been able to take a screenshot via the IDRAC web ui. https://wikimedia.slack.com/archives/C055QGPTC69/p1779303718528979?thread_ts=1779274323.128399&cid=C055QGPTC69. Does this ring a bell for anyone? Thank you!
[19:37:05] <brouberol>	 all other kafka reimages went fine, but this one gets stuck even at the 2nd retry 
[20:23:42] <jhathaway>	 brouberol: I see this error, https://ipxe.org/err/280860, but I can't recall if that is always present?
[20:24:01] <jhathaway>	 how long did you wait on the hang? does it ever timeout?
[20:24:12] <brouberol>	 all 240 retries
[20:24:37] <brouberol>	 see s.ukhe's response on #-sre. He suggested I reprovisioned the host with the --legacy host, as our parted recipes are using BIOS
[20:24:55] <brouberol>	 that didn't go well either sadly. as the cookbook wasn't able to shut down the host
[20:24:59] <jhathaway>	 hmm
[20:25:19] <brouberol>	 I'm re-attempting to reimage it it see whether it still fails the same way atm
[20:25:29] <jhathaway>	 sounds good
[21:21:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:45:20] <wikibugs>	 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10VPS-project-Phabricator: @wikimedia.org email addresses don't seem to be receiving emails sent by the test Phabricator instance - https://phabricator.wikimedia.org/T422559#11942880 (10A_smart_kitten) >>! In T422559#11942353, @Dzahn wrote: >...
[23:02:55] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11943144 (10Papaul) I sent a follow up email on this and Engineer said he will get back with me