[00:08:29] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166290
[00:08:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166290 (owner: 10TrainBranchBot)
[00:11:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[00:29:48] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166290 (owner: 10TrainBranchBot)
[00:46:27] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[00:46:41] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/8be104b639d7543f5dfd79fda06b957c109c0787a3279d744db013f4c23f0438/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[00:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[00:56:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[01:06:41] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:29:51] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:30:41] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:34:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:46:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:12:43] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti1032 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[03:13:43] <icinga-wm>	 RECOVERY - ganeti-noded running on ganeti1032 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[03:15:37] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[03:19:20] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [cloudcephosd1042] - vriley@cumin1002"
[03:19:25] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [cloudcephosd1042] - vriley@cumin1002"
[03:19:25] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:20:51] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1042
[03:21:28] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1042
[03:22:52] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[03:28:32] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[03:44:30] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[03:44:54] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[03:46:33] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[03:50:06] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[03:51:33] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[03:52:58] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[04:15:32] <logmsgbot>	 vriley@cumin1002 provision (PID 922338) is awaiting input
[04:25:07] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[04:32:16] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[04:32:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#10973882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[04:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:16:03] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] C:bird::anycast_healthchecker: notify service on conf file change [puppet] - 10https://gerrit.wikimedia.org/r/1166238 (owner: 10Ssingh)
[05:21:18] <logmsgbot>	 vriley@cumin1002 reimage (PID 928848) is awaiting input
[05:22:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10973902 (10ayounsi) There is currently only one switch per rack, so I suggest we only use one uplink for now, and revisit it the day we have more.
[05:25:31] <wikibugs>	 (03CR) 10Ayounsi: Redfish: more tests (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi)
[05:25:43] <wikibugs>	 (03PS4) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473
[05:25:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi)
[05:26:05] <wikibugs>	 (03PS5) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473
[05:32:18] <wikibugs>	 (03PS6) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473
[05:32:30] <wikibugs>	 (03CR) 10Ayounsi: Redfish: more tests (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi)
[05:34:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:36:27] <wikibugs>	 (03CR) 10Ayounsi: reimage: temporarily store the MAC in Netbox (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi)
[05:46:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250704T0600)
[06:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:12:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3007.wikimedia.org
[06:16:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3007.wikimedia.org
[06:20:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#10973917 (10VRiley-WMF)
[06:20:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc1001.eqiad.wmnet
[06:21:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#10973918 (10VRiley-WMF) Attempted to image cloudcephosd1042, however it seems to get stuck. Investigating this issue.
[06:24:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install6002.wikimedia.org to plain
[06:25:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install6002.wikimedia.org to plain
[06:25:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1001.eqiad.wmnet
[06:26:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6001.drmrs.wmnet to plain
[06:28:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6001.drmrs.wmnet to plain
[06:29:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6001.wikimedia.org to plain
[06:30:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6001.wikimedia.org to plain
[06:30:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6001.drmrs.wmnet to plain
[06:31:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6001.drmrs.wmnet to plain
[06:32:07] <moritzm>	 !log failover Ganeti master in drmrs01 to ganeti6003 T382513
[06:32:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:10] <stashbot>	 T382513: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513
[06:32:15] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:33:35] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[06:33:49] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[06:34:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc1002.eqiad.wmnet
[06:34:35] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[06:34:49] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[06:34:49] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[06:35:13] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:39:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1002.eqiad.wmnet
[06:44:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10973945 (10MoritzMuehlenhoff)
[06:57:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] C:prometheus: dnsbox_service_state_exporter s/define/class [puppet] - 10https://gerrit.wikimedia.org/r/1166224 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[06:57:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] hiera: dnsbox: set supplementary_groups for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[06:58:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] bird/anycast-hc: allow setting SupplementaryGroups for anycast-hc unit [puppet] - 10https://gerrit.wikimedia.org/r/1166222 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250704T0700)
[07:04:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442#10973970 (10ayounsi) →14Duplicate dup:03T398315
[07:04:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://phabricator.wikimedia.org/T398315#10973972 (10ayounsi)
[07:04:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442#10973975 (10ayounsi) Closing that task as duplicate of the automatically opened one. If I do the other way around,...
[07:06:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://phabricator.wikimedia.org/T398315#10973978 (10ayounsi) More information on {T398442}, probably just need the cable swapped. @Jclark-ctr @VRiley-WMF
[07:10:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver2003.codfw.wmnet
[07:16:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2003.codfw.wmnet
[07:19:44] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti6001.drmrs.wmnet with reason: reimage
[07:21:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti6001.drmrs.wmnet with OS bookworm
[07:22:05] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10973983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6001.drmrs.wmnet with OS bookworm
[07:28:32] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[07:29:59] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks mostly good, some comments in-line. I'm still a bit surprised by the complexity of this gerrit failover cookbooks. I'd hope to reduc" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[07:32:06] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671 (10MoritzMuehlenhoff) 03NEW
[07:32:27] <wikibugs>	 (03CR) 10Jelto: "lgtm, thanks for the cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1165874 (owner: 10Muehlenhoff)
[07:32:31] <wikibugs>	 (03CR) 10Jelto: [C:03+1] Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1165874 (owner: 10Muehlenhoff)
[07:35:25] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10974006 (10MoritzMuehlenhoff)
[07:36:15] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ganeti2019.codfw.wmnet
[07:39:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti6001.drmrs.wmnet with reason: host reimage
[07:40:20] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1166262 (https://phabricator.wikimedia.org/T397591) (owner: 10BryanDavis)
[07:40:39] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936) (owner: 10BryanDavis)
[07:41:19] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#10974020 (10ABran-WMF) 05Open→03In progress p:05Triage→03High neat, thanks! I've sent you a draft document to review, I'll put it on wikitech once...
[07:41:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1165874 (owner: 10Muehlenhoff)
[07:42:42] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[07:42:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[07:42:56] <vgutierrez>	 !log depooling cp7006 for testing purposes
[07:42:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti6001.drmrs.wmnet with reason: host reimage
[07:48:16] <logmsgbot>	 jmm@cumin1003 decommission (PID 224703) is awaiting input
[07:52:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[07:53:03] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2019.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003"
[07:53:21] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2019.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003"
[07:53:21] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:53:22] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2019.codfw.wmnet
[07:53:29] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10974081 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `ganeti2019.codfw.wmnet` - ganeti2019.codfw.wmnet (**PASS**)   - Downtimed host on Icinga/Alertm...
[07:53:51] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ganeti2020.codfw.wmnet
[07:56:27] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7006.magru.wmnet with reason: testing
[07:58:19] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[08:03:13] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2020.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003"
[08:03:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6001.drmrs.wmnet with OS bookworm
[08:03:34] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10974096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6001.drmrs.wmnet with OS bookworm completed: - ganeti6001 (**PASS*...
[08:04:12] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2020.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003"
[08:04:12] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:04:13] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2020.codfw.wmnet
[08:04:22] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10974097 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `ganeti2020.codfw.wmnet` - ganeti2020.codfw.wmnet (**PASS**)   - Downtimed host on Icinga/Alertm...
[08:06:20] <wikibugs>	 (03CR) 10Arnaudb: "I hope to reduce complexity over time as well! It is also why this cookbook has been decoupled from the "main one". I agree with you that " [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[08:07:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti2019/ganeti2020 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1166344 (https://phabricator.wikimedia.org/T398671)
[08:07:22] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10974101 (10MoritzMuehlenhoff)
[08:08:07] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10974105 (10MoritzMuehlenhoff)
[08:08:12] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10974107 (10MoritzMuehlenhoff)
[08:08:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2019/ganeti2020 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1166344 (https://phabricator.wikimedia.org/T398671) (owner: 10Muehlenhoff)
[08:08:54] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301)
[08:10:53] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz)
[08:11:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz)
[08:14:57] <wikibugs>	 (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.0 [puppet] - 10https://gerrit.wikimedia.org/r/1166142 (https://phabricator.wikimedia.org/T394382) (owner: 10Jelto)
[08:16:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet
[08:21:51] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Inline image is displayed incorrectly - https://phabricator.wikimedia.org/T398660#10974133 (10MatthewVernon) I think this may have been a caching issue - if I copy your wikitext into the Sandbox, I get what look to me to be correct thumbnails; and indeed the permalink to the cs....
[08:22:15] <wikibugs>	 (03PS5) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530)
[08:22:45] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks for the review, updated patchset addresses the comments I think." [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[08:25:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet
[08:32:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti6001.drmrs.wmnet to cluster drmrs01 and group B12
[08:33:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti6001.drmrs.wmnet to cluster drmrs01 and group B12
[08:33:41] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10974158 (10MoritzMuehlenhoff)
[08:33:51] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10974159 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done
[08:37:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2077.codfw.wmnet
[08:47:08] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2077 is OK: OK - load average: 8.58, 2.04, 0.68 https://wikitech.wikimedia.org/wiki/Swift
[08:47:15] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] C:bird::anycast_healthchecker: notify service on conf file change [puppet] - 10https://gerrit.wikimedia.org/r/1166238 (owner: 10Ssingh)
[08:48:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6001.drmrs.wmnet to drbd
[08:48:36] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2077.codfw.wmnet
[08:48:52] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] zarcillo: Update egress to idp.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166227 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[08:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[08:52:48] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Inline image is displayed incorrectly - https://phabricator.wikimedia.org/T398660#10974204 (10matej_suchanek) Yes, everything looks correct now.  But is the problem really resolved? The article where I noticed the problem was created [[ https://cs.wikipedia.org/w/index.php?title...
[08:55:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:55:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10974229 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium a:03Clement_Goubert
[08:56:05] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Inline image is displayed incorrectly - https://phabricator.wikimedia.org/T398660#10974233 (10MatthewVernon) What I assume happened is that some of the thumbnails from the first upload didn't get overwritten when the second upload was made (I purged the image this morning, which...
[08:58:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6001.drmrs.wmnet to drbd
[08:58:22] <icinga-wm>	 PROBLEM - Host ncredir6001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:58:41] <icinga-wm>	 RECOVERY - Host ncredir6001 is UP: PING OK - Packet loss = 0%, RTA = 87.49 ms
[08:59:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install6002.wikimedia.org to drbd
[08:59:32] <wikibugs>	 (03PS2) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301)
[09:00:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:01:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz)
[09:06:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10974246 (10cmooney) p:05High→03Medium This has been stable since the optics were replaced yesterday.  I will review again next week a...
[09:07:03] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backupmon1001.eqiad.wmnet with reason: Maintenance and reboot
[09:08:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:10:23] <wikibugs>	 (03PS1) 10Joal: Fix user and user_old views for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1166346 (https://phabricator.wikimedia.org/T398602)
[09:10:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10974254 (10MoritzMuehlenhoff)
[09:16:02] <wikibugs>	 (03CR) 10Btullis: Fix user and user_old views for WMCS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166346 (https://phabricator.wikimedia.org/T398602) (owner: 10Joal)
[09:16:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install6002.wikimedia.org to drbd
[09:16:33] <icinga-wm>	 PROBLEM - Host install6002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:16:47] <icinga-wm>	 RECOVERY - Host install6002 is UP: PING OK - Packet loss = 0%, RTA = 87.54 ms
[09:17:14] <wikibugs>	 (03PS1) 10Joal: Update analytics sqoop script tables [puppet] - 10https://gerrit.wikimedia.org/r/1166347 (https://phabricator.wikimedia.org/T398602)
[09:18:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:19:28] <wikibugs>	 (03PS2) 10Joal: Fix user and user_old views for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1166346 (https://phabricator.wikimedia.org/T398602)
[09:19:33] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update analytics sqoop script tables [puppet] - 10https://gerrit.wikimedia.org/r/1166347 (https://phabricator.wikimedia.org/T398602) (owner: 10Joal)
[09:19:35] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Update analytics sqoop script tables [puppet] - 10https://gerrit.wikimedia.org/r/1166347 (https://phabricator.wikimedia.org/T398602) (owner: 10Joal)
[09:21:12] <wikibugs>	 (03PS3) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301)
[09:21:48] <wikibugs>	 (03CR) 10Btullis: Fix user and user_old views for WMCS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166346 (https://phabricator.wikimedia.org/T398602) (owner: 10Joal)
[09:21:50] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Fix user and user_old views for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1166346 (https://phabricator.wikimedia.org/T398602) (owner: 10Joal)
[09:22:23] <wikibugs>	 (03PS1) 10Brouberol: airflow: enable the hadoop-shell to reach out to the hive metastore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166349 (https://phabricator.wikimedia.org/T398683)
[09:22:24] <wikibugs>	 (03PS1) 10Brouberol: airflow: enable Kerberos security in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166350 (https://phabricator.wikimedia.org/T398683)
[09:22:25] <wikibugs>	 (03PS1) 10Brouberol: airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166351 (https://phabricator.wikimedia.org/T398683)
[09:22:27] <wikibugs>	 (03PS1) 10Brouberol: airflow-ml: enable interactions with the production analytics cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166352 (https://phabricator.wikimedia.org/T398683)
[09:23:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10974347 (10Clement_Goubert) Please make sure the [[ https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer_account | Wikimedia global a...
[09:28:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10wikitech.wikimedia.org: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686 (10Tobi_WMDE_SW) 03NEW
[09:33:17] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10wikitech.wikimedia.org: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974387 (10Clement_Goubert)
[09:33:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10wikitech.wikimedia.org: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974388 (10Clement_Goubert)
[09:34:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:37:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10wikitech.wikimedia.org: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974404 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium @Tobi_WMDE_SW Can you or @sowmya.guru fill out the first part of the...
[09:37:11] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: enable the hadoop-shell to reach out to the hive metastore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166349 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[09:42:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974422 (10taavi)
[09:46:14] <wikibugs>	 (03CR) 10Hashar: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[09:46:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:49:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: DNS resolution not working on Juniper virtual-chassis switches eqiad - https://phabricator.wikimedia.org/T398690 (10cmooney) 03NEW p:05Triage→03Medium
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:51:56] <wikibugs>	 (03CR) 10Hashar: [C:03+2] gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:01:35] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backupmon1001.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002
[10:01:40] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: enable Kerberos security in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166350 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:01:52] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166351 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:02:07] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-ml: enable interactions with the production analytics cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166352 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:02:19] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[10:02:22] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: enable the hadoop-shell to reach out to the hive metastore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166349 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:02:25] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: enable Kerberos security in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166350 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:02:27] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166351 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:02:30] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-ml: enable interactions with the production analytics cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166352 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:03:20] <wikibugs>	 (03CR) 10Btullis: "I think that the values-analytics-production.yaml symlink also needs to be added." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166352 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:03:34] <wikibugs>	 (03PS1) 10Elukey: redfish: add support for iDRAC 10 to force_http_boot_once [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166371 (https://phabricator.wikimedia.org/T393044)
[10:04:24] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: enable the hadoop-shell to reach out to the hive metastore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166349 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:04:38] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: enable Kerberos security in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166350 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:04:49] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166351 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:04:51] <wikibugs>	 (03Merged) 10jenkins-bot: airflow-ml: enable interactions with the production analytics cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166352 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol)
[10:08:29] <wikibugs>	 (03PS1) 10Brouberol: airflow-ml: add values-analytics-production symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166372
[10:08:59] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-ml: add values-analytics-production symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166372 (owner: 10Brouberol)
[10:11:09] <wikibugs>	 (03CR) 10Jobo: [C:03+2] data.yaml: Allow tailing of spiderpig jobrunner and apiserver journals [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[10:12:01] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-ml: add values-analytics-production symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166372 (owner: 10Brouberol)
[10:13:06] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[10:13:40] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[10:14:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10974520 (10Marostegui) @VRiley-WMF is it a completely new server right? With a different asset tag? If that's the case, I think we should treat it as such and just give it a new hostname: db1259 and f...
[10:15:35] <wikibugs>	 (03PS10) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561)
[10:18:12] <logmsgbot>	 !log cgoubert@deploy1003 Locking from deployment [ALL REPOSITORIES]: Dragonfly supernodes reboot
[10:18:55] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode2001.codfw.wmnet
[10:19:00] <wikibugs>	 (03PS1) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[10:19:05] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on es2044 - https://phabricator.wikimedia.org/T398601#10974527 (10Marostegui) p:05Triage→03Medium
[10:19:07] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10974529 (10elukey) I have a patch for spicerack that I need to test, I found a workaround but I'd like to make sure it works :)
[10:19:49] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10974532 (10elukey) I am going to test the above patch in T393044, and I'll report back (same issues, new Dells have an idrac version with a lot of not documented change...
[10:22:28] <wikibugs>	 (03CR) 10Vgutierrez: pyrra: remove multi-dc for istio-based SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[10:22:43] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode2001.codfw.wmnet
[10:23:16] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode1001.eqiad.wmnet
[10:26:28] <wikibugs>	 (03CR) 10Elukey: [C:04-1] "This needs some extra work, since we are using trafficserver_backend_requests_seconds_count that is available to all DCs. I think we shoul" [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[10:26:59] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet
[10:27:19] <logmsgbot>	 !log cgoubert@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: Dragonfly supernodes reboot (duration: 09m 07s)
[10:31:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney)
[10:31:30] <wikibugs>	 (03PS2) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[10:35:21] <wikibugs>	 (03PS3) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[10:38:36] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275)
[10:38:45] <wikibugs>	 (03PS1) 10Elukey: TEST - fix http_boot_once for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1166378
[10:41:22] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[10:43:00] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2203,2212].codfw.wmnet with reason: Maintenance
[10:43:51] <wikibugs>	 (03CR) 10Jcrespo: "Should I add devtmpfs ? so we only monitor / and /srv ?" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[10:45:03] <wikibugs>	 (03CR) 10Marostegui: "+1 to exclude it too. Thanks for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[10:46:04] <wikibugs>	 (03CR) 10Jcrespo: "Feel free to answer/amend/merge it yourselves, leaving the decision to the dbas." [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[10:46:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney)
[10:46:28] <wikibugs>	 (03CR) 10Jcrespo: "Ok, amending." [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[10:48:22] <wikibugs>	 (03PS8) 10Muehlenhoff: New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762)
[10:48:24] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275)
[10:49:11] <wikibugs>	 (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[10:49:53] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[10:50:06] <wikibugs>	 (03PS4) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[10:51:04] <logmsgbot>	 !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm
[10:51:43] <wikibugs>	 (03PS1) 10Marostegui: installserver: Remove db2244 [puppet] - 10https://gerrit.wikimedia.org/r/1166381
[10:52:28] <wikibugs>	 06SRE, 06DBA, 06serviceops, 05MW-1.44-notes, and 2 others: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10974717 (10Ladsgroup) 05Open→03Resolved I filed {T398693} and {T398692} as follow ups. Closing this. Feel free to add more follow...
[10:53:11] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275)
[10:53:22] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[10:53:40] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[10:53:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[10:54:14] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275)
[10:54:22] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[10:55:20] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] "As far as I can tell, wmf.8 has successfully been deployed to all wikis. That means this change (and the one preceeding it) should be read" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163288 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm)
[10:56:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Remove db2244 [puppet] - 10https://gerrit.wikimedia.org/r/1166381 (owner: 10Marostegui)
[10:56:11] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 32 hosts with reason: maintenance
[10:57:15] <wikibugs>	 (03CR) 10Muehlenhoff: New structure for sshd_config starting with trixie (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[10:57:35] <wikibugs>	 (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/output/1166377/4452/" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[10:57:43] <wikibugs>	 (03CR) 10Muehlenhoff: "All feedback has been integrated, should be ready for a fresh review" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[10:58:19] <wikibugs>	 (03CR) 10Vgutierrez: "In https://wikitech.wikimedia.org/wiki/SLO/WDQS#Service_Level_Indicators_(SLIs) it seems like they are focusing on availability (status co" [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[10:59:39] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Thank you for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[11:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250704T0700)
[11:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: How many deployers does it take to do GitLab version upgrades deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250704T1100).
[11:00:08] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] New function to generate device-specific IBGP data from cluster YAML [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[11:01:37] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "thanks for the answers, looks good to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[11:01:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney)
[11:03:01] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "I will NOT merge it myself. if it was me, I would wait until monday and test it gradually to prevent hundreds of pages. Letting you do it " [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo)
[11:04:21] <wikibugs>	 (03PS5) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[11:05:42] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v10.0.2 with ibgp function in plugin - cmooney@cumin1003
[11:08:07] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v10.0.2 with ibgp function in plugin - cmooney@cumin1003
[11:09:11] <wikibugs>	 (03CR) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[11:09:46] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10974830 (10Marostegui) This host is in the same rack as es2044 {T398601}  Something with the rack D3?
[11:10:09] <wikibugs>	 (03PS1) 10Brouberol: airflow-k8s: add monitoring on scheduler not heartbeating [alerts] - 10https://gerrit.wikimedia.org/r/1166383 (https://phabricator.wikimedia.org/T398420)
[11:12:19] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice." [alerts] - 10https://gerrit.wikimedia.org/r/1166383 (https://phabricator.wikimedia.org/T398420) (owner: 10Brouberol)
[11:12:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[11:15:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney)
[11:19:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Adapt gitreview config to new repo name [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1166385 (https://phabricator.wikimedia.org/T365985)
[11:22:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[11:23:28] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Adapt gitreview config to new repo name [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1166385 (https://phabricator.wikimedia.org/T365985) (owner: 10Muehlenhoff)
[11:23:50] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on es2044 - https://phabricator.wikimedia.org/T398601#10974891 (10Marostegui) This host is in the same rack as db2213 {T398537} Something with the rack D3?
[11:24:48] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] [BETA CLUSTER] Stop loading VueTest, we're dropping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164155 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester)
[11:25:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for osleger [puppet] - 10https://gerrit.wikimedia.org/r/1166386
[11:25:43] <wikibugs>	 (03Merged) 10jenkins-bot: [BETA CLUSTER] Stop loading VueTest, we're dropping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164155 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester)
[11:26:34] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] "rebased in deploy host, a full scap won't be needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164155 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester)
[11:27:05] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "Will deploy this on Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164156 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester)
[11:28:33] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[11:35:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for osleger [puppet] - 10https://gerrit.wikimedia.org/r/1166386 (owner: 10Muehlenhoff)
[11:36:40] <wikibugs>	 (03Abandoned) 10JMeybohm: sre.k8s.pool-depool-cluster: Exclude w[d,c]ws from repooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1165908 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[11:52:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[11:53:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[12:10:19] <wikibugs>	 (03PS2) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[12:10:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[12:11:12] <logmsgbot>	 !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@38ba3ec]: bump section topics to v1.8.0
[12:11:49] <logmsgbot>	 !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@38ba3ec]: bump section topics to v1.8.0 (duration: 00m 49s)
[12:17:13] <wikibugs>	 (03PS1) 10Cathal Mooney: Use VC status to derive l3_switch variable and remove from YAML [homer/public] - 10https://gerrit.wikimedia.org/r/1166390
[12:18:31] <wikibugs>	 (03PS2) 10Cathal Mooney: Use VC status to derive l3_switch variable and remove from YAML [homer/public] - 10https://gerrit.wikimedia.org/r/1166390
[12:19:05] <wikibugs>	 (03PS3) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[12:19:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[12:23:56] <wikibugs>	 (03PS4) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[12:24:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[12:25:01] <wikibugs>	 (03PS1) 10Cathal Mooney: Use IP address not hostname for syslog dest on L2 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1166391 (https://phabricator.wikimedia.org/T398690)
[12:27:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975047 (10elukey) Tried with https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1166378 but then I got:  ` Booting from HTTP Device 1: NIC in Slot 10 Port 1...
[12:28:41] <wikibugs>	 (03PS5) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[12:29:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[12:31:20] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7006.magru.wmnet
[12:31:20] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7006.magru.wmnet
[12:31:55] <vgutierrez>	 !log repool cp7006
[12:31:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:25] <wikibugs>	 (03PS6) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[12:32:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[12:36:03] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Consolidate eqsin liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1166393 (https://phabricator.wikimedia.org/T396561)
[12:36:25] <wikibugs>	 (03PS7) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[12:38:12] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166393 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[12:39:55] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:40:17] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:41:47] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.043 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:42:07] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:49:59] <wikibugs>	 (03PS2) 10Elukey: TEST - fix http_boot_once for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1166378
[12:51:00] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] hiera: Consolidate eqsin liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1166393 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[12:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[12:51:41] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[12:59:35] <logmsgbot>	 !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm
[13:02:51] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[13:06:21] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Consolidate eqsin liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1166393 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[13:07:16] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Use IP address not hostname for syslog dest on L2 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1166391 (https://phabricator.wikimedia.org/T398690) (owner: 10Cathal Mooney)
[13:08:16] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Use VC status to derive l3_switch variable and remove from YAML [homer/public] - 10https://gerrit.wikimedia.org/r/1166390 (owner: 10Cathal Mooney)
[13:08:21] <wikibugs>	 (03PS3) 10Elukey: TEST - fix http_boot_once for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1166378
[13:08:54] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[13:09:45] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "ship it!" [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[13:15:58] <logmsgbot>	 !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm
[13:16:47] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Inline image is displayed incorrectly - https://phabricator.wikimedia.org/T398660#10975147 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon
[13:18:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10975172 (10Marostegui) @Jhancock.wm does this have a HW RAID?
[13:19:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975173 (10elukey) In reimage we do the following for UEFI HTTP Boot settings: `dhcp_filename = f'http://{apt_ip}/efiboot/snponly.efi'`  So we allow only http:// f...
[13:22:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975187 (10elukey) New issue:  `    ┌───────────────────────┤ [!!] Partition disks ├────────────────────────┐        │...
[13:23:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM, I think I've considered all the various edge cases, and we can safely merge this change and then rollout the hiddenparma change as s" [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[13:32:33] <wikibugs>	 (03CR) 10MVernon: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman)
[13:34:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:36:11] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] C:prometheus: dnsbox_service_state_exporter s/define/class [puppet] - 10https://gerrit.wikimedia.org/r/1166224 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:36:30] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-k8s: add monitoring on scheduler not heartbeating [alerts] - 10https://gerrit.wikimedia.org/r/1166383 (https://phabricator.wikimedia.org/T398420) (owner: 10Brouberol)
[13:37:10] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10975259 (10Marostegui) Any ETA on when could we have these hosts racked and installed?
[13:39:32] <wikibugs>	 (03PS8) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[13:43:38] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman)
[13:46:13] <wikibugs>	 (03PS1) 10Joal: Add data-eng gobblin alert for published files [alerts] - 10https://gerrit.wikimedia.org/r/1166400 (https://phabricator.wikimedia.org/T370665)
[13:46:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:47:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add data-eng gobblin alert for published files [alerts] - 10https://gerrit.wikimedia.org/r/1166400 (https://phabricator.wikimedia.org/T370665) (owner: 10Joal)
[13:47:59] <wikibugs>	 (03PS1) 10Elukey: preseed: add a new recipe for sretest2006 [puppet] - 10https://gerrit.wikimedia.org/r/1166401 (https://phabricator.wikimedia.org/T393044)
[13:48:26] <wikibugs>	 (03PS11) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917)
[13:48:32] <wikibugs>	 (03PS2) 10Ssingh: P:dns::auth::monitoring: add prometheus::dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1166210 (https://phabricator.wikimedia.org/T374619)
[13:49:44] <wikibugs>	 (03CR) 10Muehlenhoff: preseed: add a new recipe for sretest2006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166401 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey)
[13:50:56] <wikibugs>	 (03CR) 10Klausman: "Yes, that is indeed that plan, AIUI." [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman)
[13:50:57] <wikibugs>	 (03CR) 10Elukey: "Hi Matthew!" [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman)
[13:51:52] <wikibugs>	 (03PS2) 10Elukey: preseed: add a new recipe for sretest2006 [puppet] - 10https://gerrit.wikimedia.org/r/1166401 (https://phabricator.wikimedia.org/T393044)
[13:54:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1166401 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey)
[13:55:19] <wikibugs>	 (03CR) 10Elukey: [C:03+2] preseed: add a new recipe for sretest2006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166401 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey)
[13:55:39] <wikibugs>	 (03PS3) 10Ssingh: P:dns::auth::monitoring: add prometheus::dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1166210 (https://phabricator.wikimedia.org/T374619)
[13:56:01] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[13:56:41] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6145/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166210 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[14:01:24] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[14:03:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10975312 (10Stevemunene)
[14:04:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10975313 (10Stevemunene) 05Open→03Resolved The host is back online
[14:06:32] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1179.eqiad.wmnet with OS bullseye
[14:08:03] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403
[14:08:57] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403
[14:09:17] <logmsgbot>	 !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm
[14:12:21] <vgutierrez>	 !log depooling cp7006 for testing purposes
[14:12:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975330 (10elukey) I suspect that the BOSS card's RAID 1 doesn't show up as /dev/sda, but: /dev/nvme0n1p-1  Found this on /var/log/partman (d-i): ` Partitions: #...
[14:17:27] <wikibugs>	 (03PS12) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917)
[14:18:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi)
[14:18:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez)
[14:19:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10975336 (10sowmya.guru)
[14:20:39] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[14:20:50] <vgutierrez>	 !log repooling cp7006
[14:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:42] <logmsgbot>	 stevemunene@cumin1002 reimage (PID 994917) is awaiting input
[14:29:01] <logmsgbot>	 !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm
[14:32:01] <wikibugs>	 (03PS3) 10Ayounsi: Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403
[14:33:17] <wikibugs>	 (03PS1) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407
[14:36:20] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye
[14:40:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi)
[14:40:50] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1179.eqiad.wmnet with OS bullseye
[14:42:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi)
[14:43:42] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10975382 (10elukey) I tried to reimage with test-cookbook, but after the reboot I got stuck in:  ` Booting from RAID Controller in SL 1: NOSBOOT You have ordered a Dell...
[14:46:10] <logmsgbot>	 !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye
[14:57:02] <wikibugs>	 (03PS2) 10Joal: Add data-eng gobblin alert for published files [alerts] - 10https://gerrit.wikimedia.org/r/1166400 (https://phabricator.wikimedia.org/T370665)
[15:06:30] <wikibugs>	 (03CR) 10Elukey: "Seems also what they did in https://github.com/dell/iDRAC-Redfish-Scripting/commit/f0f44f653034d6af9c47e0f1fc49b5f7e19b4ad3" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166371 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:24] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add data-eng gobblin alert for published files [alerts] - 10https://gerrit.wikimedia.org/r/1166400 (https://phabricator.wikimedia.org/T370665) (owner: 10Joal)
[15:10:14] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Add data-eng gobblin alert for published files [alerts] - 10https://gerrit.wikimedia.org/r/1166400 (https://phabricator.wikimedia.org/T370665) (owner: 10Joal)
[15:11:10] <wikibugs>	 06SRE, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Suppress ATSBackendErrorsHigh for wdqs2009.codfw.wmnet - https://phabricator.wikimedia.org/T398523#10975409 (10BTullis)
[15:14:35] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:14:59] <vgutierrez>	 !log fetch haproxy 2.8.15 on thirdparty/haproxy28 component for bullseye-wikimedia (apt.wm.o)
[15:15:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:18:32] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:28:33] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[15:35:52] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10975444 (10elukey) Ok I see, there have been provisioning issues. As far as I can see, the NICs are not found, and it seems the case with the new scp dump code:  ` >>>...
[15:37:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975446 (10elukey) I tried to manually copy on apt1002 the `hwraid-1dev.cfg` recipe into another one, to swap /dev/sda with /dev/nvme0n1p-1, but I get the same iss...
[15:39:13] <wikibugs>	 (03PS1) 10Hnowlan: ratelimit: bump version number [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1166412 (https://phabricator.wikimedia.org/T388804)
[15:40:40] <wikibugs>	 (03PS2) 10Hnowlan: ratelimit: bump version number [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1166412 (https://phabricator.wikimedia.org/T388804)
[15:45:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975452 (10elukey) @Volans: summary of the issues for sretest2006:  * http_boot_once in spicerack needs to be fixed to support idrac10+, and https://gerrit.wikimed...
[15:47:03] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10975454 (10elukey) @Volans summary of the issues so far:  * Same first two as in T393044#10975451, since it is common to both systems.  * We are not using a BOSS card h...
[16:18:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:28:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[17:34:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:46:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:52:59] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 436156792 and 63 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:54:59] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 141876496 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:55:59] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 44840 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:56:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:56:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:56:54] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Include allowance for wmcloud.org in wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:56:57] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:57:20] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1165989|beta: Include allowance for wmcloud.org in wgGraphAllowedDomains (T289318)]], [[gerrit:1165999|beta: Change Beta wikidata canonical to beta.wmcloud.org (T289318)]]
[18:57:23] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[18:59:17] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1165989|beta: Include allowance for wmcloud.org in wgGraphAllowedDomains (T289318)]], [[gerrit:1165999|beta: Change Beta wikidata canonical to beta.wmcloud.org (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:19:07] <wikibugs>	 (03PS3) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936)
[19:28:33] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[19:41:05] <wikibugs>	 (03PS4) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936)
[20:07:55] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:08:17] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:09:15] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 7.221 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:09:45] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:26:36] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[20:32:13] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165989|beta: Include allowance for wmcloud.org in wgGraphAllowedDomains (T289318)]], [[gerrit:1165999|beta: Change Beta wikidata canonical to beta.wmcloud.org (T289318)]] (duration: 94m 52s)
[20:32:16] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[20:41:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[20:51:23] <wikibugs>	 (03PS1) 10Krinkle: beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318)
[20:51:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[21:10:36] <wikibugs>	 (03CR) 10Krinkle: beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[21:17:22] <wikibugs>	 (03PS2) 10Krinkle: beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318)
[21:17:25] <wikibugs>	 (03CR) 10Krinkle: beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[21:20:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[21:21:03] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[21:21:17] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1166438|beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org (T289318)]]
[21:21:20] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[21:23:14] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1166438|beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:33:44] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[21:34:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:39:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:39:29] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166438|beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org (T289318)]] (duration: 18m 12s)
[21:39:32] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[21:46:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:02:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:17:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:28:33] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[23:38:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166453
[23:38:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166453 (owner: 10TrainBranchBot)
[23:51:35] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166453 (owner: 10TrainBranchBot)