[00:08:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166290 [00:08:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166290 (owner: 10TrainBranchBot) [00:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:29:48] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166290 (owner: 10TrainBranchBot) [00:46:27] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:46:41] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/8be104b639d7543f5dfd79fda06b957c109c0787a3279d744db013f4c23f0438/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [00:56:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:06:41] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:29:51] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:30:41] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:34:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:46:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:12:43] PROBLEM - ganeti-noded running on ganeti1032 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [03:13:43] RECOVERY - ganeti-noded running on ganeti1032 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [03:15:37] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [03:19:20] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [cloudcephosd1042] - vriley@cumin1002" [03:19:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [cloudcephosd1042] - vriley@cumin1002" [03:19:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:20:51] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1042 [03:21:28] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1042 [03:22:52] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:28:32] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [03:44:30] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:44:54] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:46:33] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:50:06] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:51:33] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:52:58] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:15:32] vriley@cumin1002 provision (PID 922338) is awaiting input [04:25:07] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:32:16] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [04:32:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#10973882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [04:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:03] (03CR) 10Ayounsi: [C:03+1] C:bird::anycast_healthchecker: notify service on conf file change [puppet] - 10https://gerrit.wikimedia.org/r/1166238 (owner: 10Ssingh) [05:21:18] vriley@cumin1002 reimage (PID 928848) is awaiting input [05:22:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10973902 (10ayounsi) There is currently only one switch per rack, so I suggest we only use one uplink for now, and revisit it the day we have more. [05:25:31] (03CR) 10Ayounsi: Redfish: more tests (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi) [05:25:43] (03PS4) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 [05:25:52] (03CR) 10CI reject: [V:04-1] Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi) [05:26:05] (03PS5) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 [05:32:18] (03PS6) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 [05:32:30] (03CR) 10Ayounsi: Redfish: more tests (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi) [05:34:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:36:27] (03CR) 10Ayounsi: reimage: temporarily store the MAC in Netbox (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [05:46:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250704T0600) [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:12:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3007.wikimedia.org [06:16:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3007.wikimedia.org [06:20:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#10973917 (10VRiley-WMF) [06:20:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc1001.eqiad.wmnet [06:21:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#10973918 (10VRiley-WMF) Attempted to image cloudcephosd1042, however it seems to get stuck. Investigating this issue. [06:24:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install6002.wikimedia.org to plain [06:25:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install6002.wikimedia.org to plain [06:25:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1001.eqiad.wmnet [06:26:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6001.drmrs.wmnet to plain [06:28:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6001.drmrs.wmnet to plain [06:29:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6001.wikimedia.org to plain [06:30:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6001.wikimedia.org to plain [06:30:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6001.drmrs.wmnet to plain [06:31:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6001.drmrs.wmnet to plain [06:32:07] !log failover Ganeti master in drmrs01 to ganeti6003 T382513 [06:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:10] T382513: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513 [06:32:15] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:33:35] PROBLEM - Bird Internet Routing Daemon on doh6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:33:49] PROBLEM - Bird Internet Routing Daemon on durum6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:34:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc1002.eqiad.wmnet [06:34:35] RECOVERY - Bird Internet Routing Daemon on doh6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:34:49] RECOVERY - Bird Internet Routing Daemon on durum6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:34:49] PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [06:35:13] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:39:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1002.eqiad.wmnet [06:44:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10973945 (10MoritzMuehlenhoff) [06:57:36] (03CR) 10Filippo Giunchedi: [C:03+1] C:prometheus: dnsbox_service_state_exporter s/define/class [puppet] - 10https://gerrit.wikimedia.org/r/1166224 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [06:57:46] (03CR) 10Filippo Giunchedi: [C:03+1] hiera: dnsbox: set supplementary_groups for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [06:58:06] (03CR) 10Filippo Giunchedi: [C:03+1] bird/anycast-hc: allow setting SupplementaryGroups for anycast-hc unit [puppet] - 10https://gerrit.wikimedia.org/r/1166222 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250704T0700) [07:04:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442#10973970 (10ayounsi) →14Duplicate dup:03T398315 [07:04:11] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://phabricator.wikimedia.org/T398315#10973972 (10ayounsi) [07:04:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442#10973975 (10ayounsi) Closing that task as duplicate of the automatically opened one. If I do the other way around,... [07:06:00] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://phabricator.wikimedia.org/T398315#10973978 (10ayounsi) More information on {T398442}, probably just need the cable swapped. @Jclark-ctr @VRiley-WMF [07:10:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver2003.codfw.wmnet [07:16:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2003.codfw.wmnet [07:19:44] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti6001.drmrs.wmnet with reason: reimage [07:21:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti6001.drmrs.wmnet with OS bookworm [07:22:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10973983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6001.drmrs.wmnet with OS bookworm [07:28:32] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [07:29:59] (03CR) 10Jelto: [C:03+1] "looks mostly good, some comments in-line. I'm still a bit surprised by the complexity of this gerrit failover cookbooks. I'd hope to reduc" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:32:06] 06SRE, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671 (10MoritzMuehlenhoff) 03NEW [07:32:27] (03CR) 10Jelto: "lgtm, thanks for the cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1165874 (owner: 10Muehlenhoff) [07:32:31] (03CR) 10Jelto: [C:03+1] Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1165874 (owner: 10Muehlenhoff) [07:35:25] 06SRE, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10974006 (10MoritzMuehlenhoff) [07:36:15] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ganeti2019.codfw.wmnet [07:39:48] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti6001.drmrs.wmnet with reason: host reimage [07:40:20] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1166262 (https://phabricator.wikimedia.org/T397591) (owner: 10BryanDavis) [07:40:39] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936) (owner: 10BryanDavis) [07:41:19] 06SRE, 06collaboration-services, 06Traffic: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#10974020 (10ABran-WMF) 05Open→03In progress p:05Triage→03High neat, thanks! I've sent you a draft document to review, I'll put it on wikitech once... [07:41:54] (03CR) 10Muehlenhoff: [C:03+2] Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1165874 (owner: 10Muehlenhoff) [07:42:42] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [07:42:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:42:56] !log depooling cp7006 for testing purposes [07:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti6001.drmrs.wmnet with reason: host reimage [07:48:16] jmm@cumin1003 decommission (PID 224703) is awaiting input [07:52:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:53:03] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2019.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [07:53:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2019.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [07:53:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:53:22] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2019.codfw.wmnet [07:53:29] 06SRE, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10974081 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `ganeti2019.codfw.wmnet` - ganeti2019.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertm... [07:53:51] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ganeti2020.codfw.wmnet [07:56:27] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7006.magru.wmnet with reason: testing [07:58:19] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [08:03:13] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2020.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [08:03:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6001.drmrs.wmnet with OS bookworm [08:03:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10974096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6001.drmrs.wmnet with OS bookworm completed: - ganeti6001 (**PASS*... [08:04:12] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2020.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [08:04:12] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:04:13] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2020.codfw.wmnet [08:04:22] 06SRE, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10974097 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `ganeti2020.codfw.wmnet` - ganeti2020.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertm... [08:06:20] (03CR) 10Arnaudb: "I hope to reduce complexity over time as well! It is also why this cookbook has been decoupled from the "main one". I agree with you that " [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:07:03] (03PS1) 10Muehlenhoff: Remove ganeti2019/ganeti2020 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1166344 (https://phabricator.wikimedia.org/T398671) [08:07:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10974101 (10MoritzMuehlenhoff) [08:08:07] 06SRE, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10974105 (10MoritzMuehlenhoff) [08:08:12] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10974107 (10MoritzMuehlenhoff) [08:08:32] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2019/ganeti2020 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1166344 (https://phabricator.wikimedia.org/T398671) (owner: 10Muehlenhoff) [08:08:54] (03PS1) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) [08:10:53] (03CR) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [08:11:14] (03CR) 10CI reject: [V:04-1] statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [08:14:57] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.0 [puppet] - 10https://gerrit.wikimedia.org/r/1166142 (https://phabricator.wikimedia.org/T394382) (owner: 10Jelto) [08:16:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [08:21:51] 10SRE-swift-storage, 10Thumbor: Inline image is displayed incorrectly - https://phabricator.wikimedia.org/T398660#10974133 (10MatthewVernon) I think this may have been a caching issue - if I copy your wikitext into the Sandbox, I get what look to me to be correct thumbnails; and indeed the permalink to the cs.... [08:22:15] (03PS5) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) [08:22:45] (03CR) 10Cathal Mooney: "Thanks for the review, updated patchset addresses the comments I think." [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [08:25:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [08:32:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti6001.drmrs.wmnet to cluster drmrs01 and group B12 [08:33:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti6001.drmrs.wmnet to cluster drmrs01 and group B12 [08:33:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10974158 (10MoritzMuehlenhoff) [08:33:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10974159 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [08:37:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2077.codfw.wmnet [08:47:08] RECOVERY - very high load average likely xfs on ms-be2077 is OK: OK - load average: 8.58, 2.04, 0.68 https://wikitech.wikimedia.org/wiki/Swift [08:47:15] (03CR) 10Cathal Mooney: [C:03+1] C:bird::anycast_healthchecker: notify service on conf file change [puppet] - 10https://gerrit.wikimedia.org/r/1166238 (owner: 10Ssingh) [08:48:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6001.drmrs.wmnet to drbd [08:48:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2077.codfw.wmnet [08:48:52] (03CR) 10Clément Goubert: [C:03+1] zarcillo: Update egress to idp.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166227 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [08:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [08:52:48] 10SRE-swift-storage, 10Thumbor: Inline image is displayed incorrectly - https://phabricator.wikimedia.org/T398660#10974204 (10matej_suchanek) Yes, everything looks correct now. But is the problem really resolved? The article where I noticed the problem was created [[ https://cs.wikipedia.org/w/index.php?title... [08:55:42] FIRING: JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:55:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10974229 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium a:03Clement_Goubert [08:56:05] 10SRE-swift-storage, 10Thumbor: Inline image is displayed incorrectly - https://phabricator.wikimedia.org/T398660#10974233 (10MatthewVernon) What I assume happened is that some of the thumbnails from the first upload didn't get overwritten when the second upload was made (I purged the image this morning, which... [08:58:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6001.drmrs.wmnet to drbd [08:58:22] PROBLEM - Host ncredir6001 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:41] RECOVERY - Host ncredir6001 is UP: PING OK - Packet loss = 0%, RTA = 87.49 ms [08:59:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install6002.wikimedia.org to drbd [08:59:32] (03PS2) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) [09:00:42] RESOLVED: JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:01:46] (03CR) 10CI reject: [V:04-1] statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [09:06:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10974246 (10cmooney) p:05High→03Medium This has been stable since the optics were replaced yesterday. I will review again next week a... [09:07:03] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backupmon1001.eqiad.wmnet with reason: Maintenance and reboot [09:08:12] FIRING: [2x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:10:23] (03PS1) 10Joal: Fix user and user_old views for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1166346 (https://phabricator.wikimedia.org/T398602) [09:10:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10974254 (10MoritzMuehlenhoff) [09:16:02] (03CR) 10Btullis: Fix user and user_old views for WMCS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166346 (https://phabricator.wikimedia.org/T398602) (owner: 10Joal) [09:16:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install6002.wikimedia.org to drbd [09:16:33] PROBLEM - Host install6002 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:47] RECOVERY - Host install6002 is UP: PING OK - Packet loss = 0%, RTA = 87.54 ms [09:17:14] (03PS1) 10Joal: Update analytics sqoop script tables [puppet] - 10https://gerrit.wikimedia.org/r/1166347 (https://phabricator.wikimedia.org/T398602) [09:18:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:19:28] (03PS2) 10Joal: Fix user and user_old views for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1166346 (https://phabricator.wikimedia.org/T398602) [09:19:33] (03CR) 10Btullis: [C:03+2] Update analytics sqoop script tables [puppet] - 10https://gerrit.wikimedia.org/r/1166347 (https://phabricator.wikimedia.org/T398602) (owner: 10Joal) [09:19:35] (03CR) 10Btullis: [V:03+2 C:03+2] Update analytics sqoop script tables [puppet] - 10https://gerrit.wikimedia.org/r/1166347 (https://phabricator.wikimedia.org/T398602) (owner: 10Joal) [09:21:12] (03PS3) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) [09:21:48] (03CR) 10Btullis: Fix user and user_old views for WMCS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166346 (https://phabricator.wikimedia.org/T398602) (owner: 10Joal) [09:21:50] (03CR) 10Btullis: [C:03+2] Fix user and user_old views for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1166346 (https://phabricator.wikimedia.org/T398602) (owner: 10Joal) [09:22:23] (03PS1) 10Brouberol: airflow: enable the hadoop-shell to reach out to the hive metastore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166349 (https://phabricator.wikimedia.org/T398683) [09:22:24] (03PS1) 10Brouberol: airflow: enable Kerberos security in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166350 (https://phabricator.wikimedia.org/T398683) [09:22:25] (03PS1) 10Brouberol: airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166351 (https://phabricator.wikimedia.org/T398683) [09:22:27] (03PS1) 10Brouberol: airflow-ml: enable interactions with the production analytics cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166352 (https://phabricator.wikimedia.org/T398683) [09:23:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10974347 (10Clement_Goubert) Please make sure the [[ https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer_account | Wikimedia global a... [09:28:15] 06SRE, 10SRE-Access-Requests, 10wikitech.wikimedia.org: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686 (10Tobi_WMDE_SW) 03NEW [09:33:17] 06SRE, 10SRE-Access-Requests, 10wikitech.wikimedia.org: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974387 (10Clement_Goubert) [09:33:29] 06SRE, 10SRE-Access-Requests, 10wikitech.wikimedia.org: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974388 (10Clement_Goubert) [09:34:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:37:00] 06SRE, 10SRE-Access-Requests, 10wikitech.wikimedia.org: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974404 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium @Tobi_WMDE_SW Can you or @sowmya.guru fill out the first part of the... [09:37:11] (03CR) 10Btullis: [C:03+1] airflow: enable the hadoop-shell to reach out to the hive metastore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166349 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [09:42:22] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974422 (10taavi) [09:46:14] (03CR) 10Hashar: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [09:46:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:52] 06SRE, 06Infrastructure-Foundations, 10netops: DNS resolution not working on Juniper virtual-chassis switches eqiad - https://phabricator.wikimedia.org/T398690 (10cmooney) 03NEW p:05Triage→03Medium [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:51:56] (03CR) 10Hashar: [C:03+2] gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:01:35] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backupmon1001.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002 [10:01:40] (03CR) 10Btullis: [C:03+1] airflow: enable Kerberos security in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166350 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:01:52] (03CR) 10Btullis: [C:03+1] airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166351 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:02:07] (03CR) 10Btullis: [C:03+1] airflow-ml: enable interactions with the production analytics cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166352 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:02:19] (03Merged) 10jenkins-bot: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [10:02:22] (03CR) 10Brouberol: [C:03+2] airflow: enable the hadoop-shell to reach out to the hive metastore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166349 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:02:25] (03CR) 10Brouberol: [C:03+2] airflow: enable Kerberos security in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166350 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:02:27] (03CR) 10Brouberol: [C:03+2] airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166351 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:02:30] (03CR) 10Brouberol: [C:03+2] airflow-ml: enable interactions with the production analytics cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166352 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:03:20] (03CR) 10Btullis: "I think that the values-analytics-production.yaml symlink also needs to be added." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166352 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:03:34] (03PS1) 10Elukey: redfish: add support for iDRAC 10 to force_http_boot_once [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166371 (https://phabricator.wikimedia.org/T393044) [10:04:24] (03Merged) 10jenkins-bot: airflow: enable the hadoop-shell to reach out to the hive metastore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166349 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:04:38] (03Merged) 10jenkins-bot: airflow: enable Kerberos security in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166350 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:04:49] (03Merged) 10jenkins-bot: airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166351 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:04:51] (03Merged) 10jenkins-bot: airflow-ml: enable interactions with the production analytics cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166352 (https://phabricator.wikimedia.org/T398683) (owner: 10Brouberol) [10:08:29] (03PS1) 10Brouberol: airflow-ml: add values-analytics-production symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166372 [10:08:59] (03CR) 10Btullis: [C:03+1] airflow-ml: add values-analytics-production symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166372 (owner: 10Brouberol) [10:11:09] (03CR) 10Jobo: [C:03+2] data.yaml: Allow tailing of spiderpig jobrunner and apiserver journals [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [10:12:01] (03CR) 10Brouberol: [C:03+2] airflow-ml: add values-analytics-production symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166372 (owner: 10Brouberol) [10:13:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [10:13:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [10:14:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10974520 (10Marostegui) @VRiley-WMF is it a completely new server right? With a different asset tag? If that's the case, I think we should treat it as such and just give it a new hostname: db1259 and f... [10:15:35] (03PS10) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) [10:18:12] !log cgoubert@deploy1003 Locking from deployment [ALL REPOSITORIES]: Dragonfly supernodes reboot [10:18:55] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode2001.codfw.wmnet [10:19:00] (03PS1) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [10:19:05] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on es2044 - https://phabricator.wikimedia.org/T398601#10974527 (10Marostegui) p:05Triage→03Medium [10:19:07] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10974529 (10elukey) I have a patch for spicerack that I need to test, I found a workaround but I'd like to make sure it works :) [10:19:49] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10974532 (10elukey) I am going to test the above patch in T393044, and I'll report back (same issues, new Dells have an idrac version with a lot of not documented change... [10:22:28] (03CR) 10Vgutierrez: pyrra: remove multi-dc for istio-based SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [10:22:43] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode2001.codfw.wmnet [10:23:16] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode1001.eqiad.wmnet [10:26:28] (03CR) 10Elukey: [C:04-1] "This needs some extra work, since we are using trafficserver_backend_requests_seconds_count that is available to all DCs. I think we shoul" [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [10:26:59] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet [10:27:19] !log cgoubert@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: Dragonfly supernodes reboot (duration: 09m 07s) [10:31:04] (03CR) 10CI reject: [V:04-1] Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [10:31:30] (03PS2) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [10:35:21] (03PS3) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [10:38:36] (03PS1) 10Jcrespo: mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) [10:38:45] (03PS1) 10Elukey: TEST - fix http_boot_once for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1166378 [10:41:22] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [10:43:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2203,2212].codfw.wmnet with reason: Maintenance [10:43:51] (03CR) 10Jcrespo: "Should I add devtmpfs ? so we only monitor / and /srv ?" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [10:45:03] (03CR) 10Marostegui: "+1 to exclude it too. Thanks for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [10:46:04] (03CR) 10Jcrespo: "Feel free to answer/amend/merge it yourselves, leaving the decision to the dbas." [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [10:46:13] (03CR) 10CI reject: [V:04-1] Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [10:46:28] (03CR) 10Jcrespo: "Ok, amending." [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [10:48:22] (03PS8) 10Muehlenhoff: New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) [10:48:24] (03PS2) 10Jcrespo: mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) [10:49:11] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [10:49:53] (03CR) 10Jcrespo: [C:03+1] mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [10:50:06] (03PS4) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [10:51:04] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [10:51:43] (03PS1) 10Marostegui: installserver: Remove db2244 [puppet] - 10https://gerrit.wikimedia.org/r/1166381 [10:52:28] 06SRE, 06DBA, 06serviceops, 05MW-1.44-notes, and 2 others: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10974717 (10Ladsgroup) 05Open→03Resolved I filed {T398693} and {T398692} as follow ups. Closing this. Feel free to add more follow... [10:53:11] (03PS3) 10Jcrespo: mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) [10:53:22] (03CR) 10Jcrespo: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [10:53:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [10:53:42] (03CR) 10CI reject: [V:04-1] mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [10:54:14] (03PS4) 10Jcrespo: mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) [10:54:22] (03CR) 10Jcrespo: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [10:55:20] (03CR) 10Michael Große: [C:03+1] "As far as I can tell, wmf.8 has successfully been deployed to all wikis. That means this change (and the one preceeding it) should be read" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163288 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [10:56:09] (03CR) 10Marostegui: [C:03+2] installserver: Remove db2244 [puppet] - 10https://gerrit.wikimedia.org/r/1166381 (owner: 10Marostegui) [10:56:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 32 hosts with reason: maintenance [10:57:15] (03CR) 10Muehlenhoff: New structure for sshd_config starting with trixie (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [10:57:35] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/output/1166377/4452/" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [10:57:43] (03CR) 10Muehlenhoff: "All feedback has been integrated, should be ready for a fresh review" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [10:58:19] (03CR) 10Vgutierrez: "In https://wikitech.wikimedia.org/wiki/SLO/WDQS#Service_Level_Indicators_(SLIs) it seems like they are focusing on availability (status co" [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [10:59:39] (03CR) 10Marostegui: [C:03+1] "Thank you for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250704T0700) [11:00:05] jelto, arnoldokoth, and mutante: How many deployers does it take to do GitLab version upgrades deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250704T1100). [11:00:08] (03CR) 10Cathal Mooney: [C:03+2] New function to generate device-specific IBGP data from cluster YAML [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [11:01:37] (03CR) 10Jelto: [C:03+1] "thanks for the answers, looks good to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [11:01:43] (03CR) 10CI reject: [V:04-1] Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [11:03:01] (03CR) 10Jcrespo: [C:03+1] "I will NOT merge it myself. if it was me, I would wait until monday and test it gradually to prevent hundreds of pages. Letting you do it " [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [11:04:21] (03PS5) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [11:05:42] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v10.0.2 with ibgp function in plugin - cmooney@cumin1003 [11:08:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v10.0.2 with ibgp function in plugin - cmooney@cumin1003 [11:09:11] (03CR) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [11:09:46] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10974830 (10Marostegui) This host is in the same rack as es2044 {T398601} Something with the rack D3? [11:10:09] (03PS1) 10Brouberol: airflow-k8s: add monitoring on scheduler not heartbeating [alerts] - 10https://gerrit.wikimedia.org/r/1166383 (https://phabricator.wikimedia.org/T398420) [11:12:19] (03CR) 10Btullis: [C:03+1] "Nice." [alerts] - 10https://gerrit.wikimedia.org/r/1166383 (https://phabricator.wikimedia.org/T398420) (owner: 10Brouberol) [11:12:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:15:52] (03CR) 10CI reject: [V:04-1] Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [11:19:28] (03PS1) 10Muehlenhoff: Adapt gitreview config to new repo name [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1166385 (https://phabricator.wikimedia.org/T365985) [11:22:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:23:28] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Adapt gitreview config to new repo name [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1166385 (https://phabricator.wikimedia.org/T365985) (owner: 10Muehlenhoff) [11:23:50] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on es2044 - https://phabricator.wikimedia.org/T398601#10974891 (10Marostegui) This host is in the same rack as db2213 {T398537} Something with the rack D3? [11:24:48] (03CR) 10Ladsgroup: [C:03+2] [BETA CLUSTER] Stop loading VueTest, we're dropping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164155 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester) [11:25:32] (03PS1) 10Muehlenhoff: Record LDAP access for osleger [puppet] - 10https://gerrit.wikimedia.org/r/1166386 [11:25:43] (03Merged) 10jenkins-bot: [BETA CLUSTER] Stop loading VueTest, we're dropping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164155 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester) [11:26:34] (03CR) 10Ladsgroup: [C:03+2] "rebased in deploy host, a full scap won't be needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164155 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester) [11:27:05] (03CR) 10Ladsgroup: [C:03+1] "Will deploy this on Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164156 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester) [11:28:33] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [11:35:55] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for osleger [puppet] - 10https://gerrit.wikimedia.org/r/1166386 (owner: 10Muehlenhoff) [11:36:40] (03Abandoned) 10JMeybohm: sre.k8s.pool-depool-cluster: Exclude w[d,c]ws from repooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1165908 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:52:33] (03PS1) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [11:53:03] (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [12:10:19] (03PS2) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [12:10:49] (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [12:11:12] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@38ba3ec]: bump section topics to v1.8.0 [12:11:49] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@38ba3ec]: bump section topics to v1.8.0 (duration: 00m 49s) [12:17:13] (03PS1) 10Cathal Mooney: Use VC status to derive l3_switch variable and remove from YAML [homer/public] - 10https://gerrit.wikimedia.org/r/1166390 [12:18:31] (03PS2) 10Cathal Mooney: Use VC status to derive l3_switch variable and remove from YAML [homer/public] - 10https://gerrit.wikimedia.org/r/1166390 [12:19:05] (03PS3) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [12:19:36] (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [12:23:56] (03PS4) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [12:24:26] (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [12:25:01] (03PS1) 10Cathal Mooney: Use IP address not hostname for syslog dest on L2 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1166391 (https://phabricator.wikimedia.org/T398690) [12:27:24] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975047 (10elukey) Tried with https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1166378 but then I got: ` Booting from HTTP Device 1: NIC in Slot 10 Port 1... [12:28:41] (03PS5) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [12:29:11] (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [12:31:20] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7006.magru.wmnet [12:31:20] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7006.magru.wmnet [12:31:55] !log repool cp7006 [12:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:25] (03PS6) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [12:32:56] (03CR) 10CI reject: [V:04-1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [12:36:03] (03PS1) 10Vgutierrez: hiera: Consolidate eqsin liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1166393 (https://phabricator.wikimedia.org/T396561) [12:36:25] (03PS7) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [12:38:12] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166393 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [12:39:55] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:40:17] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:41:47] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.043 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:42:07] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:59] (03PS2) 10Elukey: TEST - fix http_boot_once for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1166378 [12:51:00] (03CR) 10Slyngshede: [C:03+1] hiera: Consolidate eqsin liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1166393 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [12:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [12:51:41] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [12:59:35] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [13:02:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:06:21] (03CR) 10Vgutierrez: [C:03+2] hiera: Consolidate eqsin liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1166393 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:07:16] (03CR) 10Ayounsi: [C:03+1] Use IP address not hostname for syslog dest on L2 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1166391 (https://phabricator.wikimedia.org/T398690) (owner: 10Cathal Mooney) [13:08:16] (03CR) 10Ayounsi: [C:03+1] Use VC status to derive l3_switch variable and remove from YAML [homer/public] - 10https://gerrit.wikimedia.org/r/1166390 (owner: 10Cathal Mooney) [13:08:21] (03PS3) 10Elukey: TEST - fix http_boot_once for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1166378 [13:08:54] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [13:09:45] (03CR) 10Ayounsi: [C:03+1] "ship it!" [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [13:15:58] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [13:16:47] 10SRE-swift-storage, 10Thumbor: Inline image is displayed incorrectly - https://phabricator.wikimedia.org/T398660#10975147 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [13:18:57] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10975172 (10Marostegui) @Jhancock.wm does this have a HW RAID? [13:19:19] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975173 (10elukey) In reimage we do the following for UEFI HTTP Boot settings: `dhcp_filename = f'http://{apt_ip}/efiboot/snponly.efi'` So we allow only http:// f... [13:22:29] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975187 (10elukey) New issue: ` ┌───────────────────────┤ [!!] Partition disks ├────────────────────────┐ │... [13:23:48] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM, I think I've considered all the various edge cases, and we can safely merge this change and then rollout the hiddenparma change as s" [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:32:33] (03CR) 10MVernon: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [13:34:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:36:11] (03CR) 10Ssingh: [C:03+2] C:prometheus: dnsbox_service_state_exporter s/define/class [puppet] - 10https://gerrit.wikimedia.org/r/1166224 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:36:30] (03CR) 10Brouberol: [C:03+2] airflow-k8s: add monitoring on scheduler not heartbeating [alerts] - 10https://gerrit.wikimedia.org/r/1166383 (https://phabricator.wikimedia.org/T398420) (owner: 10Brouberol) [13:37:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10975259 (10Marostegui) Any ETA on when could we have these hosts racked and installed? [13:39:32] (03PS8) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [13:43:38] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [13:46:13] (03PS1) 10Joal: Add data-eng gobblin alert for published files [alerts] - 10https://gerrit.wikimedia.org/r/1166400 (https://phabricator.wikimedia.org/T370665) [13:46:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:27] (03CR) 10CI reject: [V:04-1] Add data-eng gobblin alert for published files [alerts] - 10https://gerrit.wikimedia.org/r/1166400 (https://phabricator.wikimedia.org/T370665) (owner: 10Joal) [13:47:59] (03PS1) 10Elukey: preseed: add a new recipe for sretest2006 [puppet] - 10https://gerrit.wikimedia.org/r/1166401 (https://phabricator.wikimedia.org/T393044) [13:48:26] (03PS11) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917) [13:48:32] (03PS2) 10Ssingh: P:dns::auth::monitoring: add prometheus::dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1166210 (https://phabricator.wikimedia.org/T374619) [13:49:44] (03CR) 10Muehlenhoff: preseed: add a new recipe for sretest2006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166401 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [13:50:56] (03CR) 10Klausman: "Yes, that is indeed that plan, AIUI." [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [13:50:57] (03CR) 10Elukey: "Hi Matthew!" [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [13:51:52] (03PS2) 10Elukey: preseed: add a new recipe for sretest2006 [puppet] - 10https://gerrit.wikimedia.org/r/1166401 (https://phabricator.wikimedia.org/T393044) [13:54:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1166401 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [13:55:19] (03CR) 10Elukey: [C:03+2] preseed: add a new recipe for sretest2006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166401 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [13:55:39] (03PS3) 10Ssingh: P:dns::auth::monitoring: add prometheus::dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1166210 (https://phabricator.wikimedia.org/T374619) [13:56:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:56:41] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6145/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166210 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:01:24] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [14:03:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10975312 (10Stevemunene) [14:04:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10975313 (10Stevemunene) 05Open→03Resolved The host is back online [14:06:32] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1179.eqiad.wmnet with OS bullseye [14:08:03] (03PS1) 10Ayounsi: Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 [14:08:57] (03PS2) 10Ayounsi: Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 [14:09:17] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [14:12:21] !log depooling cp7006 for testing purposes [14:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:48] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975330 (10elukey) I suspect that the BOSS card's RAID 1 doesn't show up as /dev/sda, but: /dev/nvme0n1p-1 Found this on /var/log/partman (d-i): ` Partitions: #... [14:17:27] (03PS12) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917) [14:18:13] (03CR) 10CI reject: [V:04-1] Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi) [14:18:28] (03CR) 10Giuseppe Lavagetto: [C:03+1] cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [14:19:22] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10975336 (10sowmya.guru) [14:20:39] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [14:20:50] !log repooling cp7006 [14:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:42] stevemunene@cumin1002 reimage (PID 994917) is awaiting input [14:29:01] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [14:32:01] (03PS3) 10Ayounsi: Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 [14:33:17] (03PS1) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [14:36:20] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [14:40:05] (03CR) 10CI reject: [V:04-1] WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [14:40:50] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1179.eqiad.wmnet with OS bullseye [14:42:40] (03CR) 10CI reject: [V:04-1] Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi) [14:43:42] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10975382 (10elukey) I tried to reimage with test-cookbook, but after the reboot I got stuck in: ` Booting from RAID Controller in SL 1: NOSBOOT You have ordered a Dell... [14:46:10] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye [14:57:02] (03PS2) 10Joal: Add data-eng gobblin alert for published files [alerts] - 10https://gerrit.wikimedia.org/r/1166400 (https://phabricator.wikimedia.org/T370665) [15:06:30] (03CR) 10Elukey: "Seems also what they did in https://github.com/dell/iDRAC-Redfish-Scripting/commit/f0f44f653034d6af9c47e0f1fc49b5f7e19b4ad3" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166371 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:24] (03CR) 10Brouberol: [C:03+1] Add data-eng gobblin alert for published files [alerts] - 10https://gerrit.wikimedia.org/r/1166400 (https://phabricator.wikimedia.org/T370665) (owner: 10Joal) [15:10:14] (03CR) 10Brouberol: [C:03+2] Add data-eng gobblin alert for published files [alerts] - 10https://gerrit.wikimedia.org/r/1166400 (https://phabricator.wikimedia.org/T370665) (owner: 10Joal) [15:11:10] 06SRE, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Suppress ATSBackendErrorsHigh for wdqs2009.codfw.wmnet - https://phabricator.wikimedia.org/T398523#10975409 (10BTullis) [15:14:35] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:59] !log fetch haproxy 2.8.15 on thirdparty/haproxy28 component for bullseye-wikimedia (apt.wm.o) [15:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:32] RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:28:33] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [15:35:52] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10975444 (10elukey) Ok I see, there have been provisioning issues. As far as I can see, the NICs are not found, and it seems the case with the new scp dump code: ` >>>... [15:37:50] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975446 (10elukey) I tried to manually copy on apt1002 the `hwraid-1dev.cfg` recipe into another one, to swap /dev/sda with /dev/nvme0n1p-1, but I get the same iss... [15:39:13] (03PS1) 10Hnowlan: ratelimit: bump version number [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1166412 (https://phabricator.wikimedia.org/T388804) [15:40:40] (03PS2) 10Hnowlan: ratelimit: bump version number [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1166412 (https://phabricator.wikimedia.org/T388804) [15:45:15] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10975452 (10elukey) @Volans: summary of the issues for sretest2006: * http_boot_once in spicerack needs to be fixed to support idrac10+, and https://gerrit.wikimed... [15:47:03] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10975454 (10elukey) @Volans summary of the issues so far: * Same first two as in T393044#10975451, since it is common to both systems. * We are not using a BOSS card h... [16:18:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:28:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [17:34:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:46:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:52:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 436156792 and 63 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:54:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 141876496 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:55:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 44840 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:56:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:56:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:56:54] (03Merged) 10jenkins-bot: beta: Include allowance for wmcloud.org in wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:56:57] (03Merged) 10jenkins-bot: beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:57:20] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1165989|beta: Include allowance for wmcloud.org in wgGraphAllowedDomains (T289318)]], [[gerrit:1165999|beta: Change Beta wikidata canonical to beta.wmcloud.org (T289318)]] [18:57:23] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [18:59:17] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1165989|beta: Include allowance for wmcloud.org in wgGraphAllowedDomains (T289318)]], [[gerrit:1165999|beta: Change Beta wikidata canonical to beta.wmcloud.org (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:19:07] (03PS3) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936) [19:28:33] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [19:41:05] (03PS4) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936) [20:07:55] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:08:17] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:09:15] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 7.221 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:09:45] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:26:36] !log krinkle@deploy1003 krinkle: Continuing with sync [20:32:13] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165989|beta: Include allowance for wmcloud.org in wgGraphAllowedDomains (T289318)]], [[gerrit:1165999|beta: Change Beta wikidata canonical to beta.wmcloud.org (T289318)]] (duration: 94m 52s) [20:32:16] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [20:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [20:51:23] (03PS1) 10Krinkle: beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318) [20:51:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:10:36] (03CR) 10Krinkle: beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [21:17:22] (03PS2) 10Krinkle: beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318) [21:17:25] (03CR) 10Krinkle: beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [21:20:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [21:21:03] (03Merged) 10jenkins-bot: beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166438 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [21:21:17] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1166438|beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org (T289318)]] [21:21:20] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [21:23:14] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1166438|beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:33:44] !log krinkle@deploy1003 krinkle: Continuing with sync [21:34:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:29] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166438|beta: Change loginwiki/metawiki/auth canonical to beta.wmcloud.org (T289318)]] (duration: 18m 12s) [21:39:32] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [21:46:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:02:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:17:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:28:33] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [23:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166453 [23:38:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166453 (owner: 10TrainBranchBot) [23:51:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166453 (owner: 10TrainBranchBot)