[00:11:14] PROBLEM - SSH on an-druid1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:11:50] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:14:56] vriley@cumin1003 reimage (PID 3337403) is awaiting input [00:16:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:50:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:50:10] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1273.eqiad.wmnet with OS bookworm [00:50:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1273.eqiad.wmnet with OS bookworm completed: - db1273 (**PASS**) -... [00:52:35] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [00:56:25] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1274] - vriley@cumin1003" [00:56:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1274] - vriley@cumin1003" [00:56:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:57:38] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1274 [00:58:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1274 [00:59:39] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1274.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:04:00] vriley@cumin1003 provision (PID 3352153) is awaiting input [01:08:16] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [01:09:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1286531 [01:09:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1286531 (owner: 10TrainBranchBot) [01:12:01] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1275] - vriley@cumin1003" [01:12:08] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1275] - vriley@cumin1003" [01:12:08] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:12:30] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1275 [01:12:59] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1274.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:14:19] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1275 [01:18:26] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1275.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:20:18] jouncebot: nowandnext [01:20:18] No deployments scheduled for the next 4 hour(s) and 39 minute(s) [01:20:18] In 4 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T0600) [01:21:33] (03PS1) 10Zabe: Start reading from new tables everywhere except commons (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286532 (https://phabricator.wikimedia.org/T416548) [01:22:47] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1286531 (owner: 10TrainBranchBot) [01:22:54] (03CR) 10Zabe: [C:03+2] Start reading from new tables everywhere except commons (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286532 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [01:23:49] (03Merged) 10jenkins-bot: Start reading from new tables everywhere except commons (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286532 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [01:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 11h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [01:25:59] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1286532|Start reading from new tables everywhere except commons (2nd try) (T416548)]] [01:26:02] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [01:27:14] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1274.eqiad.wmnet with OS bookworm [01:27:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915489 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1274.eqiad.wmnet with OS bookworm [01:27:56] !log zabe@deploy1003 zabe: Backport for [[gerrit:1286532|Start reading from new tables everywhere except commons (2nd try) (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:28:25] !log zabe@deploy1003 zabe: Continuing with deployment [01:32:34] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286532|Start reading from new tables everywhere except commons (2nd try) (T416548)]] (duration: 06m 35s) [01:32:38] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [01:37:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1275.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:41:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915532 (10VRiley-WMF) 05Open→03Resolved [01:41:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915535 (10VRiley-WMF) 05Resolved→03Open [01:42:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915538 (10VRiley-WMF) [01:43:41] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:28] (03CR) 10Dragoniez: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19) [01:58:03] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1275.eqiad.wmnet with OS bookworm [01:58:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915540 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1275.eqiad.wmnet with OS bookworm [02:00:47] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:02] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [02:07:31] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 44s) [02:09:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:48] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1276] - vriley@cumin1003" [02:10:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1276] - vriley@cumin1003" [02:10:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:11:43] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1276 [02:13:26] FIRING: [50x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:13:45] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1275.eqiad.wmnet with reason: host reimage [02:15:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1276 [02:15:04] RECOVERY - SSH on an-druid1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:15:22] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid