[00:03:34] <swfrench-wmf>	 !log reprepro include php-apcu_5.1.24-1+wmf11u1 in component/php83 - T398245
[00:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:37] <stashbot>	 T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245
[00:08:37] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166049
[00:08:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166049 (owner: 10TrainBranchBot)
[00:08:41] <swfrench-wmf>	 !log reprepro include php-igbinary_3.2.16-4+wmf11u1 in component/php83 - T398245
[00:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:08:44] <stashbot>	 T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245
[00:09:40] <swfrench-wmf>	 !log reprepro include php-msgpack_3.0.0-1+wmf11u1 in component/php83 - T398245
[00:09:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:36] <wikibugs>	 (03PS1) 10DDesouza: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166051
[00:21:13] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166051 (owner: 10DDesouza)
[00:22:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] add initial blubber .pipeline config and a README [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166044 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[00:23:11] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166051 (owner: 10DDesouza)
[00:25:26] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] add initial blubber .pipeline config and a README [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166044 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[00:29:01] <wikibugs>	 (03PS1) 10Dzahn: add blubber skeleton config, use base image nodejs [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166052 (https://phabricator.wikimedia.org/T268199)
[00:32:01] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166049 (owner: 10TrainBranchBot)
[00:35:39] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] add blubber skeleton config, use base image nodejs [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166052 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[00:36:28] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[00:37:21] <wikibugs>	 06SRE, 06Data-Platform-SRE: Suppress ATSBackendErrorsHigh for wdqs2009.codfw.wmnet - https://phabricator.wikimedia.org/T398523#10970526 (10Scott_French) 05Open→03Resolved a:03RKemper It has been over 1h since https://gerrit.wikimedia.org/r/1166016 was merged, and subsequent puppet runs on the prometh...
[00:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[01:02:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:27:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:29:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:48:38] <wikibugs>	 (03PS1) 10DDesouza: miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166059 (https://phabricator.wikimedia.org/T344471)
[01:49:41] <wikibugs>	 (03PS1) 10DDesouza: miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166060 (https://phabricator.wikimedia.org/T219903)
[01:51:35] <wikibugs>	 (03PS1) 10Andrew Bogott: maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061
[01:51:45] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166060 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza)
[01:51:46] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166059 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[01:53:47] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[01:53:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 (owner: 10Andrew Bogott)
[01:53:56] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[01:53:58] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[01:54:10] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[01:54:11] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[01:54:17] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166060 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza)
[01:54:18] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166059 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[01:54:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:54:25] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[01:54:48] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[01:54:50] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[01:54:51] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[01:54:53] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[01:54:54] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[01:54:56] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[01:55:21] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[01:55:34] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[01:55:35] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[01:55:47] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[01:55:48] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[01:56:05] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[01:56:10] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[01:56:22] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[01:56:23] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[01:56:39] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[01:56:40] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[01:56:58] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[01:57:14] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service people1004:30443 has failed probes (http_design_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:58:10] <wikibugs>	 (03PS2) 10Andrew Bogott: maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061
[02:00:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 (owner: 10Andrew Bogott)
[02:03:07] <wikibugs>	 (03PS3) 10Andrew Bogott: maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061
[02:05:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 (owner: 10Andrew Bogott)
[02:06:19] <wikibugs>	 (03PS4) 10Andrew Bogott: maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061
[02:08:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 (owner: 10Andrew Bogott)
[02:15:16] <wikibugs>	 (03PS5) 10Andrew Bogott: maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061
[02:19:26] <wikibugs>	 (03CR) 10Andrew Bogott: "Raymond -- I'm not 100% convinced that this is better, it makes this more resilient for the particular dumb issue that I caused but may no" [puppet] - 10https://gerrit.wikimedia.org/r/1166061 (owner: 10Andrew Bogott)
[03:06:53] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[03:18:42] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[03:22:24] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[03:28:32] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[03:38:22] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm
[04:29:29] <wikibugs>	 06SRE, 06Data-Engineering: Include accept-language header in turnilo/superset - https://phabricator.wikimedia.org/T398213#10970650 (10Joe) 05Open→03Resolved Thanks @JAllemandou @BTullis for the assistance!
[04:39:28] <icinga-wm>	 PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[04:42:28] <icinga-wm>	 RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.008 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[04:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:21:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:22:51] <wikibugs>	 (03PS1) 10MusikAnimal: codeFolding: fix folding <ref> [extensions/CodeMirror] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166067 (https://phabricator.wikimedia.org/T398430)
[05:23:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CodeMirror] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166067 (https://phabricator.wikimedia.org/T398430) (owner: 10MusikAnimal)
[05:50:04] <wikibugs>	 (03PS5) 10Abijeet Patro: CX: Add virtual-cx-shared DatabaseVirtualDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513)
[05:56:52] <icinga-wm>	 PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:57:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service people1004:30443 has failed probes (http_design_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0600).
[06:24:30] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM. I am starting to think that maybe we should make a Cmnd_Alias for SpiderPig, but that's not supported in our Puppet code at the mome" [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[06:31:20] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kubernetes: improve naming [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[06:31:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:34:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6002.drmrs.wmnet to drbd
[06:36:03] <wikibugs>	 06SRE, 06Editing-team, 06Fundraising-Backlog, 06Traffic-Icebox, and 5 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085#10970742 (10Wellverywell) On what exactly is this task stalled? AFAICS it was planned to be done in early 2020?
[06:38:00] <wikibugs>	 (03PS1) 10Elukey: pyrra: fix success ration config for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166068
[06:40:10] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:41:23] <wikibugs>	 (03CR) 10Elukey: [C:03+2] pyrra: fix success ration config for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166068 (owner: 10Elukey)
[06:41:45] <wikibugs>	 (03PS2) 10Dzahn: microsites: refactor blackbox checks to use resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi)
[06:43:03] <wikibugs>	 (03CR) 10Dzahn: "Hey, thank you for this! sorry for the delay. I was out of office and just saw this today. (maybe also because it was marked as WIP) and n" [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi)
[06:45:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6002.drmrs.wmnet to drbd
[06:45:32] <icinga-wm>	 PROBLEM - Host durum6002 is DOWN: PING CRITICAL - Packet loss = 100%
[06:45:40] <icinga-wm>	 RECOVERY - Host durum6002 is UP: PING OK - Packet loss = 0%, RTA = 87.57 ms
[06:47:03] <wikibugs>	 (03PS1) 10Elukey: pyrra: fix success-ratio's default regex [puppet] - 10https://gerrit.wikimedia.org/r/1166070
[06:47:32] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[06:47:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6002.wikimedia.org to drbd
[06:49:10] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:49:32] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[06:49:34] <wikibugs>	 (03CR) 10Elukey: [C:03+2] pyrra: fix success-ratio's default regex [puppet] - 10https://gerrit.wikimedia.org/r/1166070 (owner: 10Elukey)
[06:49:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet
[06:49:53] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10970754 (10ops-monitoring-bot) Draining ganeti6003.drmrs.wmnet of running VMs
[06:50:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6003.drmrs.wmnet
[06:50:41] <logmsgbot>	 jmm@cumin2002 changedisk (PID 158663) is awaiting input
[06:56:10] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:56:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:57:10] <wikibugs>	 (03CR) 10Volans: [C:03+2] kubernetes: improve naming [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[06:57:15] <wikibugs>	 (03PS1) 10Elukey: pyrra: fix-2 for the slo success-ratio's regex [puppet] - 10https://gerrit.wikimedia.org/r/1166071
[06:57:36] <wikibugs>	 (03CR) 10Volans: [C:03+2] postinst: clear stale files [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165839 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[06:58:00] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: improve naming [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[06:58:21] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6128/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166071 (owner: 10Elukey)
[06:58:30] <wikibugs>	 (03PS1) 10Jelto: microsites::monitoring: update body_regex_matches [puppet] - 10https://gerrit.wikimedia.org/r/1166073 (https://phabricator.wikimedia.org/T398528)
[06:58:32] <wikibugs>	 (03Merged) 10jenkins-bot: postinst: clear stale files [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165839 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[07:00:04] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: fix-2 for the slo success-ratio's regex [puppet] - 10https://gerrit.wikimedia.org/r/1166071 (owner: 10Elukey)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0700).
[07:00:05] <jouncebot>	 musikanimal: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:00:59] <musikanimal>	 o/
[07:01:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6002.wikimedia.org to drbd
[07:01:46] <icinga-wm>	 PROBLEM - Host doh6002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:01:50] <icinga-wm>	 RECOVERY - Host doh6002 is UP: PING OK - Packet loss = 0%, RTA = 87.47 ms
[07:02:59] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on prometheus6002.drmrs.wmnet with reason: switch disk type back to DRBD
[07:03:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus6002.drmrs.wmnet to drbd
[07:03:48] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[07:04:48] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[07:05:01] <musikanimal>	 I'm hoping someone is on call today. My patch is small, but somewhat critical to get deployed before wmf.8 lands on group2 tomorrow
[07:05:10] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:05:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:07:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Lumen (2001:1900:2100::a99) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[07:08:10] <Amir1>	 musikanimal: can you self serve?
[07:09:34] <musikanimal>	 I am not trained to do deploys, no
[07:10:06] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] codeFolding: fix folding <ref> [extensions/CodeMirror] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166067 (https://phabricator.wikimedia.org/T398430) (owner: 10MusikAnimal)
[07:11:14] <wikibugs>	 (03Merged) 10jenkins-bot: codeFolding: fix folding <ref> [extensions/CodeMirror] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166067 (https://phabricator.wikimedia.org/T398430) (owner: 10MusikAnimal)
[07:12:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/4 (Transit: Lumen (442550278) {#1503}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:14:04] <wikibugs>	 10SRE-SLO: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534 (10elukey) 03NEW p:05Triage→03High
[07:14:23] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1166067|codeFolding: fix folding <ref> (T398430)]]
[07:14:26] <stashbot>	 T398430: Cursor does not update upon move when nesting unclosed tags in CM6 - https://phabricator.wikimedia.org/T398430
[07:15:27] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: wdqs-main-availability drmrs - https://slo.wikimedia.org/?search=wdqs-main-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[07:15:39] <wikibugs>	 (03CR) 10Jelto: [C:03+2] microsites::monitoring: update body_regex_matches [puppet] - 10https://gerrit.wikimedia.org/r/1166073 (https://phabricator.wikimedia.org/T398528) (owner: 10Jelto)
[07:16:44] <logmsgbot>	 !log ladsgroup@deploy1003 musikanimal, ladsgroup: Backport for [[gerrit:1166067|codeFolding: fix folding <ref> (T398430)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:16:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Unvendor jquery [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166075 (https://phabricator.wikimedia.org/T397696)
[07:16:54] <Amir1>	 ^
[07:16:57] <Amir1>	 please verify
[07:17:08] <musikanimal>	 doing!
[07:17:23] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2318.mgmt:22 - https://phabricator.wikimedia.org/T398536 (10phaultfinder) 03NEW
[07:17:35] <wikibugs>	 10ops-codfw, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537 (10FCeratto-WMF) 03NEW
[07:18:28] <vgutierrez>	 !log depooling cp7006 for requestctl debugging
[07:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:29] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for thanos-fe2007.mgmt:22 - https://phabricator.wikimedia.org/T398538 (10phaultfinder) 03NEW
[07:18:30] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for db2152.mgmt:22 - https://phabricator.wikimedia.org/T398539 (10phaultfinder) 03NEW
[07:18:31] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2200.mgmt:22 - https://phabricator.wikimedia.org/T398540 (10phaultfinder) 03NEW
[07:18:32] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for cirrussearch2115.mgmt:22 - https://phabricator.wikimedia.org/T398541 (10phaultfinder) 03NEW
[07:18:33] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for arclamp2001.mgmt:22 - https://phabricator.wikimedia.org/T398543 (10phaultfinder) 03NEW
[07:18:34] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for gerrit2002.mgmt:22 - https://phabricator.wikimedia.org/T398542 (10phaultfinder) 03NEW
[07:18:37] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for conf2006.mgmt:22 - https://phabricator.wikimedia.org/T398546 (10phaultfinder) 03NEW
[07:18:41] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for gerrit2003.mgmt:22 - https://phabricator.wikimedia.org/T398544 (10phaultfinder) 03NEW
[07:18:45] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2192.mgmt:22 - https://phabricator.wikimedia.org/T398547 (10phaultfinder) 03NEW
[07:18:50] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2220.mgmt:22 - https://phabricator.wikimedia.org/T398545 (10phaultfinder) 03NEW
[07:18:54] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2193.mgmt:22 - https://phabricator.wikimedia.org/T398550 (10phaultfinder) 03NEW
[07:18:58] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2216.mgmt:22 - https://phabricator.wikimedia.org/T398548 (10phaultfinder) 03NEW
[07:19:02] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2226.mgmt:22 - https://phabricator.wikimedia.org/T398551 (10phaultfinder) 03NEW
[07:19:06] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2319.mgmt:22 - https://phabricator.wikimedia.org/T398549 (10phaultfinder) 03NEW
[07:19:10] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for mc-misc2002.mgmt:22 - https://phabricator.wikimedia.org/T398552 (10phaultfinder) 03NEW
[07:19:14] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2218.mgmt:22 - https://phabricator.wikimedia.org/T398554 (10phaultfinder) 03NEW
[07:19:18] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for maps2008.mgmt:22 - https://phabricator.wikimedia.org/T398553 (10phaultfinder) 03NEW
[07:19:22] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for restbase2038.mgmt:22 - https://phabricator.wikimedia.org/T398555 (10phaultfinder) 03NEW
[07:19:28] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971019 (10FCeratto-WMF)
[07:19:36] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2219.mgmt:22 - https://phabricator.wikimedia.org/T398556 (10phaultfinder) 03NEW
[07:19:44] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for bast2003.mgmt:22 - https://phabricator.wikimedia.org/T398557 (10phaultfinder) 03NEW
[07:19:48] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2329.mgmt:22 - https://phabricator.wikimedia.org/T398559 (10phaultfinder) 03NEW
[07:19:50] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[07:19:52] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2223.mgmt:22 - https://phabricator.wikimedia.org/T398558 (10phaultfinder) 03NEW
[07:19:56] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for pc2016.mgmt:22 - https://phabricator.wikimedia.org/T398560 (10phaultfinder) 03NEW
[07:20:00] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2201.mgmt:22 - https://phabricator.wikimedia.org/T398561 (10phaultfinder) 03NEW
[07:20:04] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2320.mgmt:22 - https://phabricator.wikimedia.org/T398562 (10phaultfinder) 03NEW
[07:20:08] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2330.mgmt:22 - https://phabricator.wikimedia.org/T398563 (10phaultfinder) 03NEW
[07:20:12] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2217.mgmt:22 - https://phabricator.wikimedia.org/T398564 (10phaultfinder) 03NEW
[07:20:16] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for db2173.mgmt:22 - https://phabricator.wikimedia.org/T398565 (10phaultfinder) 03NEW
[07:20:20] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for puppetdb2003.mgmt:22 - https://phabricator.wikimedia.org/T398567 (10phaultfinder) 03NEW
[07:20:21] <Amir1>	 XioNoX: topranks 
[07:20:24] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2225.mgmt:22 - https://phabricator.wikimedia.org/T398566 (10phaultfinder) 03NEW
[07:20:28] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for db2182.mgmt:22 - https://phabricator.wikimedia.org/T398568 (10phaultfinder) 03NEW
[07:20:32] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for es2044.mgmt:22 - https://phabricator.wikimedia.org/T398569 (10phaultfinder) 03NEW
[07:20:36] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for es2040.mgmt:22 - https://phabricator.wikimedia.org/T398570 (10phaultfinder) 03NEW
[07:20:40] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for db2213.mgmt:22 - https://phabricator.wikimedia.org/T398571 (10phaultfinder) 03NEW
[07:20:44] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for aux-k8s-worker2009.mgmt:22 - https://phabricator.wikimedia.org/T398572 (10phaultfinder) 03NEW
[07:20:50] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 16 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[07:20:55] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2227.mgmt:22 - https://phabricator.wikimedia.org/T398574 (10phaultfinder) 03NEW
[07:20:59] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for db2181.mgmt:22 - https://phabricator.wikimedia.org/T398573 (10phaultfinder) 03NEW
[07:21:03] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, one question inline" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166075 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[07:21:04] <musikanimal>	 Amir1: OK confirmed! looks good :)
[07:21:07] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for db2214.mgmt:22 - https://phabricator.wikimedia.org/T398575 (10phaultfinder) 03NEW
[07:21:10] <logmsgbot>	 !log ladsgroup@deploy1003 musikanimal, ladsgroup: Continuing with sync
[07:21:11] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for restbase2035.mgmt:22 - https://phabricator.wikimedia.org/T398576 (10phaultfinder) 03NEW
[07:21:18] <Amir1>	 moving forward
[07:21:19] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2222.mgmt:22 - https://phabricator.wikimedia.org/T398577 (10phaultfinder) 03NEW
[07:21:20] <musikanimal>	 thank you sooo much!
[07:21:25] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for puppetserver2004.mgmt:22 - https://phabricator.wikimedia.org/T398578 (10phaultfinder) 03NEW
[07:21:29] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for thanos-be2005.mgmt:22 - https://phabricator.wikimedia.org/T398579 (10phaultfinder) 03NEW
[07:21:33] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2186.mgmt:22 - https://phabricator.wikimedia.org/T398582 (10phaultfinder) 03NEW
[07:21:37] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2221.mgmt:22 - https://phabricator.wikimedia.org/T398581 (10phaultfinder) 03NEW
[07:21:41] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for db2219.mgmt:22 - https://phabricator.wikimedia.org/T398580 (10phaultfinder) 03NEW
[07:21:45] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for es2048.mgmt:22 - https://phabricator.wikimedia.org/T398583 (10phaultfinder) 03NEW
[07:21:49] <wikibugs>	 (03PS1) 10Elukey: pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534)
[07:21:53] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2198.mgmt:22 - https://phabricator.wikimedia.org/T398584 (10phaultfinder) 03NEW
[07:21:57] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for db2174.mgmt:22 - https://phabricator.wikimedia.org/T398585 (10phaultfinder) 03NEW
[07:22:01] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for db2220.mgmt:22 - https://phabricator.wikimedia.org/T398587 (10phaultfinder) 03NEW
[07:22:05] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for db2195.mgmt:22 - https://phabricator.wikimedia.org/T398586 (10phaultfinder) 03NEW
[07:22:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[07:22:18] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6129/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[07:22:47] <Amir1>	 yw ^_^
[07:22:51] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/4 (Transit: Lumen (442550278) {#1503}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:23:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10971251 (10Ladsgroup) >>! In T393296#10970263, @VRiley-WMF wrote: > We have received the Seed Server for this unit. Would we like to use a new/different name but set it up in the same location?   Manu...
[07:25:50] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[07:26:30] <wikibugs>	 (03PS2) 10Elukey: pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534)
[07:26:39] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166067|codeFolding: fix folding <ref> (T398430)]] (duration: 12m 16s)
[07:26:42] <stashbot>	 T398430: Cursor does not update upon move when nesting unclosed tags in CM6 - https://phabricator.wikimedia.org/T398430
[07:27:03] <Amir1>	 musikanimal: fully deployed
[07:27:18] <musikanimal>	 you're the best! and I mean that
[07:27:25] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6130/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[07:27:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Lumen (2001:1900:2100::a99) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[07:27:39] <wikibugs>	 (03CR) 10Muehlenhoff: Unvendor jquery (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166075 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[07:28:32] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[07:28:44] <Amir1>	 <3
[07:29:05] <wikibugs>	 (03CR) 10Arnaudb: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[07:29:50] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 16 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[07:30:06] <wikibugs>	 (03CR) 10Arnaudb: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[07:30:58] <wikibugs>	 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 10MW-1.45-notes (1.45.0-wmf.8; 2025-07-01): Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10971315 (10elukey) a:05DLynch→03elukey Thanks, I see the metrics now in Prometheu...
[07:34:46] <effie>	 !log upload php-excimer_1.2.5-1+wmf11u1
[07:34:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:41] <wikibugs>	 (03CR) 10Volans: [C:03+1] "ship it" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166075 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[07:37:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add makefile to manage the repo basic tasks [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166078
[07:37:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Code changes: * Search for reason in actions [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166079
[07:37:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add makefile to manage the repo basic tasks [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166078 (owner: 10Giuseppe Lavagetto)
[07:37:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Code changes: * Search for reason in actions [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166079 (owner: 10Giuseppe Lavagetto)
[07:38:35] <wikibugs>	 (03PS1) 10Jelto: miscweb: bump all remaining miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166080 (https://phabricator.wikimedia.org/T398303)
[07:38:40] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Feature: search in response reasons - oblivian@cumin1003"
[07:38:41] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: search in response reasons - oblivian@cumin1003
[07:39:11] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: search in response reasons - oblivian@cumin1003
[07:39:12] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Feature: search in response reasons - oblivian@cumin1003"
[07:39:25] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[07:41:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Unvendor jquery [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166075 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[07:41:17] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166080 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto)
[07:42:05] <effie>	 !jouncebot now
[07:42:05] <wm-bot>	 a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot
[07:42:12] <effie>	 jouncebot: now
[07:42:13] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0700)
[07:42:16] <effie>	 jouncebot: next 
[07:42:17] <jouncebot>	 In 0 hour(s) and 17 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0800)
[07:42:28] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7006.magru.wmnet with reason: haproxy testing
[07:44:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Depend on jquery and setup symlinks to the paths used by Debian [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166081
[07:44:51] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: bump all remaining miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166080 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto)
[07:45:49] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166081 (owner: 10Muehlenhoff)
[07:46:26] <wikibugs>	 (03PS1) 10Federico Ceratto: CAS: Add wmf group for Zarcillo, remove ops [puppet] - 10https://gerrit.wikimedia.org/r/1166082 (https://phabricator.wikimedia.org/T395304)
[07:46:27] <wikibugs>	 (03CR) 10Federico Ceratto: "This CR reintroduces CR 1161873 as discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1166082 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto)
[07:46:48] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: bump all remaining miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166080 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto)
[07:47:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service people1004:30443 has failed probes (http_design_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:49:38] <logmsgbot>	 !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[07:50:08] <logmsgbot>	 !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[07:51:54] <logmsgbot>	 !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[07:52:14] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service people1004:30443 has failed probes (http_design_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:52:21] <logmsgbot>	 !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[07:52:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repool pc4 T378715', diff saved to https://phabricator.wikimedia.org/P78744 and previous config saved to /var/cache/conftool/dbconfig/20250703-075225-ladsgroup.json
[07:52:28] <stashbot>	 T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715
[07:53:05] <logmsgbot>	 !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[07:53:35] <logmsgbot>	 !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[07:53:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Depend on jquery and setup symlinks to the paths used by Debian [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166081 (owner: 10Muehlenhoff)
[07:55:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a lintian override for a false positive around the use of Bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166123 (https://phabricator.wikimedia.org/T397696)
[07:56:44] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166123 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[07:57:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add a lintian override for a false positive around the use of Bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166123 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[08:00:05] <jouncebot>	 jnuche and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0800).
[08:01:00] <jnuche>	 morning, train is happening in 5m
[08:03:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus6002.drmrs.wmnet to drbd
[08:03:42] <icinga-wm>	 PROBLEM - Host prometheus6002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:34] <icinga-wm>	 RECOVERY - Host prometheus6002 is UP: PING OK - Packet loss = 0%, RTA = 87.50 ms
[08:05:43] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166129 (https://phabricator.wikimedia.org/T392178)
[08:05:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166129 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot)
[08:06:34] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.6.5 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166130
[08:06:37] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166129 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot)
[08:06:44] <wikibugs>	 (03CR) 10Volans: [C:03+2] Upstream release v0.6.5 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166130 (owner: 10Volans)
[08:07:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install6002.wikimedia.org to plain
[08:07:36] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v0.6.5 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166130 (owner: 10Volans)
[08:08:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install6002.wikimedia.org to plain
[08:08:58] <wikibugs>	 (03Abandoned) 10Majavah: P:toolforge::grid: disable webservicemonitor [puppet] - 10https://gerrit.wikimedia.org/r/888347 (https://phabricator.wikimedia.org/T329467) (owner: 10Majavah)
[08:10:27] <jinxer-wm>	 RESOLVED: [3x] SLOMetricAbsent: wdqs-main-availability drmrs - https://slo.wikimedia.org/?search=wdqs-main-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[08:13:03] <volans>	 !log uploaded debmonitor-server,python3-debmonitor_0.6.5 to apt.wikimedia.org bookworm-wikimedia
[08:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:56] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588 (10Shisma) 03NEW
[08:14:36] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.8  refs T392178
[08:14:39] <stashbot>	 T392178: 1.45.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T392178
[08:15:34] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[08:18:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6001.wikimedia.org to plain
[08:20:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6001.wikimedia.org to plain
[08:21:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6001.drmrs.wmnet to plain
[08:22:10] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:23:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6001.drmrs.wmnet to plain
[08:23:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6001.drmrs.wmnet to plain
[08:25:24] <wikibugs>	 (03PS1) 10Volans: kubernetes: show also the image OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696)
[08:25:32] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971395 (10Ladsgroup) This is master of s5 in codfw. @FCeratto-WMF Can you do a switchover of s5 in codfw ASAP? If this goes down, we lose the whole section in codfw.
[08:25:46] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[08:26:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6001.drmrs.wmnet to plain
[08:26:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast6003.wikimedia.org to plain
[08:27:10] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:27:48] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[08:28:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast6003.wikimedia.org to plain
[08:29:29] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti6003.drmrs.wmnet with reason: reimage
[08:32:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti6003.drmrs.wmnet with OS bookworm
[08:32:11] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10971418 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6003.drmrs.wmnet with OS bookworm
[08:35:56] <wikibugs>	 (03CR) 10Ladsgroup: CX: Add virtual-cx-shared DatabaseVirtualDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro)
[08:36:36] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet
[08:37:13] <moritzm>	 FYI, ml-etcd1002 and dse-k8s-etcd1002 will briefly go down for a Ganeti node reboot
[08:37:22] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1048.eqiad.wmnet
[08:39:03] <wikibugs>	 (03PS1) 10Zabe: Use correct index on categorylinks [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166133 (https://phabricator.wikimedia.org/T385890)
[08:39:20] <icinga-wm>	 PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:50] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:40:26] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[08:40:30] <icinga-wm>	 RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms
[08:42:35] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1048.eqiad.wmnet
[08:42:43] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1048.eqiad.wmnet
[08:48:40] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host krb1002.eqiad.wmnet
[08:50:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti6003.drmrs.wmnet with reason: host reimage
[08:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[08:53:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti6003.drmrs.wmnet with reason: host reimage
[08:53:16] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Looks good! I left just very minor optional things, no need to re-review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon)
[08:54:18] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1002.eqiad.wmnet
[08:57:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, a few questions below:" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[08:58:40] <wikibugs>	 (03PS6) 10Urbanecm: [Growth] Remove support code for Surfacing Structured Tasks experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163028 (https://phabricator.wikimedia.org/T397515)
[08:58:45] <wikibugs>	 (03PS3) 10Urbanecm: [Growth] Remove feature flags related to Surfacing Structured Tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163288 (https://phabricator.wikimedia.org/T397515)
[08:59:25] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Switch lvs5006 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1166134 (https://phabricator.wikimedia.org/T396561)
[09:01:39] <wikibugs>	 (03PS1) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534)
[09:02:36] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166134 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[09:02:38] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6131/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[09:04:53] <wikibugs>	 (03CR) 10Volans: "LGTM but I have a question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi)
[09:06:43] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1166138 (https://phabricator.wikimedia.org/T398593)
[09:06:48] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1166139 (https://phabricator.wikimedia.org/T398593)
[09:08:08] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1166134 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[09:08:24] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1166140 (https://phabricator.wikimedia.org/T398594)
[09:09:57] <wikibugs>	 (03PS3) 10Elukey: pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534)
[09:09:57] <wikibugs>	 (03PS2) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534)
[09:09:57] <wikibugs>	 (03PS1) 10Elukey: pyrra: fix k8s cluster name for the revertrisk SLO [puppet] - 10https://gerrit.wikimedia.org/r/1166141
[09:10:53] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6132/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[09:11:49] <wikibugs>	 (03CR) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[09:13:08] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:13:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6003.drmrs.wmnet with OS bookworm
[09:13:16] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10971538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6003.drmrs.wmnet with OS bookworm completed: - ganeti6003 (**PASS*...
[09:13:22] <icinga-wm>	 PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:14:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:14:39] <wikibugs>	 (03CR) 10Elukey: [C:03+2] pyrra: fix k8s cluster name for the revertrisk SLO [puppet] - 10https://gerrit.wikimedia.org/r/1166141 (owner: 10Elukey)
[09:15:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:15:31] <wikibugs>	 (03CR) 10Volans: [C:03+1] "post-merge optional suggestion" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi)
[09:16:49] <wikibugs>	 (03CR) 10A smart kitten: ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166012 (https://phabricator.wikimedia.org/T390798) (owner: 10Arlolra)
[09:17:25] <wikibugs>	 (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.0 [puppet] - 10https://gerrit.wikimedia.org/r/1166142 (https://phabricator.wikimedia.org/T394382)
[09:17:55] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971548 (10FCeratto-WMF) p:05Triage→03High a:03FCeratto-WMF
[09:18:26] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971551 (10FCeratto-WMF)
[09:19:18] <wikibugs>	 (03CR) 10Volans: [C:03+1] reimage: add dhcp MAC address support for physical hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi)
[09:19:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:21:04] <logmsgbot>	 !log fceratto@cumin1002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T398593
[09:21:07] <stashbot>	 T398593: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T398593
[09:21:57] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1004.eqiad.wmnet with reason: Maintenance and reboot
[09:22:44] <effie>	 jouncebot:  now
[09:22:44] <jouncebot>	 For the next 0 hour(s) and 37 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0800)
[09:23:05] <effie>	 Amir1: you did a backport just now and the train is done to my undrstanding ?
[09:23:15] <wikibugs>	 (03CR) 10Volans: "minor comment inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[09:24:18] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T398594
[09:24:21] <stashbot>	 T398594: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T398594
[09:25:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2192 from API/vslow/dump T398594', diff saved to https://phabricator.wikimedia.org/P78745 and previous config saved to /var/cache/conftool/dbconfig/20250703-092522-fceratto.json
[09:27:08] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:27:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:27:26] <wikibugs>	 (03PS2) 10Majavah: natlog: Add explicit dependency to file_line [puppet] - 10https://gerrit.wikimedia.org/r/1165921 (https://phabricator.wikimedia.org/T273734)
[09:27:26] <wikibugs>	 (03PS1) 10Majavah: hieradata: Enable hourly logrotate on codfw cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1166145 (https://phabricator.wikimedia.org/T273734)
[09:27:29] <wikibugs>	 (03PS1) 10Majavah: hieradata: Enable hourly logrotate in all cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1166146 (https://phabricator.wikimedia.org/T273734)
[09:27:37] <Amir1>	 I did the backport long time ago, I don't know about the train xD
[09:28:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[09:29:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:30:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:30:21] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7006.magru.wmnet
[09:30:21] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7006.magru.wmnet
[09:30:33] <wikibugs>	 (03CR) 10Majavah: [C:03+2] natlog: Add explicit dependency to file_line [puppet] - 10https://gerrit.wikimedia.org/r/1165921 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[09:30:44] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Enable hourly logrotate on codfw cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1166145 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[09:31:07] <vgutierrez>	 !log repooling cp7006
[09:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:24] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1166140 (https://phabricator.wikimedia.org/T398594) (owner: 10Gerrit maintenance bot)
[09:32:30] <federico3>	 taavi: you might see a pending puppet change from me that needs merging
[09:32:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598 (10cmooney) 03NEW p:05Triage→03High
[09:32:50] <federico3>	 ok I was able to merge it after you
[09:32:53] <taavi>	 federico3: that didn't come up for me, up to you to merge
[09:33:03] <jinxer-wm>	 RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[09:34:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:34:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10971628 (10cmooney)
[09:34:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2186.mgmt:22 - https://phabricator.wikimedia.org/T398582#10971629 (10cmooney)
[09:34:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2221.mgmt:22 - https://phabricator.wikimedia.org/T398581#10971630 (10cmooney)
[09:34:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2219.mgmt:22 - https://phabricator.wikimedia.org/T398580#10971631 (10cmooney)
[09:34:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for thanos-be2005.mgmt:22 - https://phabricator.wikimedia.org/T398579#10971632 (10cmooney)
[09:34:52] <federico3>	 !log Starting s5 codfw failover from db2213 to db2192 - T398594
[09:34:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2181.mgmt:22 - https://phabricator.wikimedia.org/T398573#10971633 (10cmooney)
[09:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:55] <stashbot>	 T398594: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T398594
[09:34:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for aux-k8s-worker2009.mgmt:22 - https://phabricator.wikimedia.org/T398572#10971634 (10cmooney)
[09:35:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2213.mgmt:22 - https://phabricator.wikimedia.org/T398571#10971635 (10cmooney)
[09:35:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2040.mgmt:22 - https://phabricator.wikimedia.org/T398570#10971636 (10cmooney)
[09:35:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2044.mgmt:22 - https://phabricator.wikimedia.org/T398569#10971637 (10cmooney)
[09:35:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2182.mgmt:22 - https://phabricator.wikimedia.org/T398568#10971638 (10cmooney)
[09:35:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for puppetdb2003.mgmt:22 - https://phabricator.wikimedia.org/T398567#10971639 (10cmooney)
[09:35:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2173.mgmt:22 - https://phabricator.wikimedia.org/T398565#10971640 (10cmooney)
[09:35:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2217.mgmt:22 - https://phabricator.wikimedia.org/T398564#10971641 (10cmooney)
[09:35:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2320.mgmt:22 - https://phabricator.wikimedia.org/T398562#10971643 (10cmooney)
[09:35:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2330.mgmt:22 - https://phabricator.wikimedia.org/T398563#10971642 (10cmooney)
[09:35:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for pc2016.mgmt:22 - https://phabricator.wikimedia.org/T398560#10971645 (10cmooney)
[09:35:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2201.mgmt:22 - https://phabricator.wikimedia.org/T398561#10971644 (10cmooney)
[09:35:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2329.mgmt:22 - https://phabricator.wikimedia.org/T398559#10971646 (10cmooney)
[09:35:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for bast2003.mgmt:22 - https://phabricator.wikimedia.org/T398557#10971647 (10cmooney)
[09:35:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2219.mgmt:22 - https://phabricator.wikimedia.org/T398556#10971649 (10cmooney)
[09:36:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2192 to s5 primary T398594', diff saved to https://phabricator.wikimedia.org/P78746 and previous config saved to /var/cache/conftool/dbconfig/20250703-093612-fceratto.json
[09:38:05] <wikibugs>	 (03PS1) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534)
[09:39:01] <wikibugs>	 (03CR) 10Volans: "LGTM, very minor suggestions inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway)
[09:39:03] <wikibugs>	 (03PS2) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534)
[09:39:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2213 weights T398594', diff saved to https://phabricator.wikimedia.org/P78747 and previous config saved to /var/cache/conftool/dbconfig/20250703-093943-fceratto.json
[09:40:01] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6134/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[09:40:18] <wikibugs>	 (03CR) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[09:40:26] <Amir1>	 federico3: don't switchover eqiad please, that requires read only down time and it's not needed
[09:40:31] <wikibugs>	 (03CR) 10Volans: reimage: add support for using the host UUID for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[09:40:56] <federico3>	 Amir1: yes, I'm not touching eq, only cod
[09:41:22] <Amir1>	 should I close T398593 then?
[09:41:22] <stashbot>	 T398593: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T398593
[09:42:54] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971674 (10FCeratto-WMF) db2213 has been flipped, now it's a candidate master
[09:45:15] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971693 (10Ladsgroup) Thanks! I'd wait for dc ops. it's a bit high prio since it's candidate master but less prio because it's not a master now
[09:48:08] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: PSU issue on es2044 - https://phabricator.wikimedia.org/T398601 (10FCeratto-WMF) 03NEW
[09:50:25] <wikibugs>	 (03CR) 10Volans: "Great to see some additional tests, thanks! Few typos an a simplification suggestion inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi)
[09:51:14] <wikibugs>	 (03Abandoned) 10Ladsgroup: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1166139 (https://phabricator.wikimedia.org/T398593) (owner: 10Gerrit maintenance bot)
[09:51:34] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1166138 (https://phabricator.wikimedia.org/T398593) (owner: 10Gerrit maintenance bot)
[09:59:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:00:04] <jouncebot>	 effie: How many deployers does it take to do MediaWiki infrastructure (UTC mid-day) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1000).
[10:00:21] <volans>	 !log upgrading production debmonitor-server to the latest v0.6.5
[10:00:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:33] <effie>	 1 is the loneliest number 
[10:01:43] <wikibugs>	 (03CR) 10Volans: [C:03+2] debmonitor: add link to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/1164999 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[10:01:55] <wikibugs>	 (03CR) 10Volans: [C:03+2] debmonitor: use the new endpoint for the check [puppet] - 10https://gerrit.wikimedia.org/r/1164485 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[10:03:02] <icinga-wm>	 PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/Debmonitor
[10:03:22] <volans>	 downtime didn't arrive in time, sorry
[10:03:32] <jinxer-wm>	 FIRING: ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:04:35] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:05:42] <logmsgbot>	 !log volans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on debmonitor2003.codfw.wmnet,debmonitor1003.eqiad.wmnet,debmonitor-dev2001.codfw.wmnet with reason: deploy new version
[10:07:19] <wikibugs>	 (03PS1) 10Effie Mouzeli: php8.1: rebuild images to pick up excimer 1.2.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165594 (https://phabricator.wikimedia.org/T397907) (owner: 10Scott French)
[10:07:27] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] php8.1: rebuild images to pick up excimer 1.2.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165594 (https://phabricator.wikimedia.org/T397907) (owner: 10Scott French)
[10:07:51] <wikibugs>	 (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] php8.1: rebuild images to pick up excimer 1.2.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165594 (https://phabricator.wikimedia.org/T397907) (owner: 10Scott French)
[10:08:24] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetserver2001.codfw.wmnet
[10:08:29] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:12:11] <wikibugs>	 (03PS1) 10Samtar: InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166155 (https://phabricator.wikimedia.org/T377978)
[10:12:13] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2001.codfw.wmnet
[10:16:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs - https://phabricator.wikimedia.org/T398464#10971808 (10Volans) Totally agree there is no point. For the `idrac` one the only potential use case would be to match it with our existing `mgmt` b...
[10:17:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#10971813 (10Volans) Option 2 LGTM too
[10:20:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs - https://phabricator.wikimedia.org/T398464#10971830 (10cmooney) >>! In T398464#10971808, @Volans wrote: > Totally agree there is no point. For the `idrac` one the only potential use case woul...
[10:20:50] <wikibugs>	 (03PS1) 10Clément Goubert: team-sre/mw-cron: Fix description command [alerts] - 10https://gerrit.wikimedia.org/r/1166156
[10:21:51] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] team-sre/mw-cron: Fix description command [alerts] - 10https://gerrit.wikimedia.org/r/1166156 (owner: 10Clément Goubert)
[10:22:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#10971847 (10cmooney)
[10:23:36] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1005.eqiad.wmnet with reason: Maintenance and reboot
[10:23:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs - https://phabricator.wikimedia.org/T398464#10971849 (10Volans) Ack, let's do both: disable it in the bios and skip it in the import
[10:24:02] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] team-sre/mw-cron: Fix description command [alerts] - 10https://gerrit.wikimedia.org/r/1166156 (owner: 10Clément Goubert)
[10:25:11] <wikibugs>	 (03Merged) 10jenkins-bot: team-sre/mw-cron: Fix description command [alerts] - 10https://gerrit.wikimedia.org/r/1166156 (owner: 10Clément Goubert)
[10:26:14] <wikibugs>	 (03PS1) 10Volans: debmonitor: use the new endpoint for checks [puppet] - 10https://gerrit.wikimedia.org/r/1166158 (https://phabricator.wikimedia.org/T397696)
[10:26:48] <logmsgbot>	 !log jiji@deploy1003 Started scap sync-world:  T397907 - Upgrade Excimer to 1.2.5 in production
[10:26:52] <stashbot>	 T397907: Upgrade Excimer to 1.2.5 in production - https://phabricator.wikimedia.org/T397907
[10:27:13] <wikibugs>	 06SRE, 10decommission-hardware, 06Infrastructure-Foundations: decommission puppetserver2003 - https://phabricator.wikimedia.org/T398607 (10MoritzMuehlenhoff) 03NEW
[10:29:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetserver2003 from serving requests [dns] - 10https://gerrit.wikimedia.org/r/1166159 (https://phabricator.wikimedia.org/T398607)
[10:33:03] <icinga-wm>	 RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor2003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 578 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[10:37:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1166158 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[10:37:16] <wikibugs>	 (03CR) 10Volans: [C:03+2] debmonitor: use the new endpoint for checks [puppet] - 10https://gerrit.wikimedia.org/r/1166158 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[10:42:17] <logmsgbot>	 !log jiji@deploy1003 Stopping before sync operations
[10:42:29] <wikibugs>	 06SRE, 07SRE-Unowned, 10Deployments, 06Release-Engineering-Team: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10971949 (10hnowlan) In recent weeks I've noticed more and more tendency to say "let's move to -sre" when an incident begins or lots of coordin...
[10:43:33] <logmsgbot>	 !log jiji@deploy1003 Locking from deployment [ALL REPOSITORIES]: T397907 - Upgrade Excimer to 1.2.5 in production in progress, blocking deploys
[10:43:36] <stashbot>	 T397907: Upgrade Excimer to 1.2.5 in production - https://phabricator.wikimedia.org/T397907
[10:44:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetserver2003 from active Puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1166160 (https://phabricator.wikimedia.org/T398607)
[10:44:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetserver role frm puppetserver2003 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607)
[10:44:23] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[10:44:41] <wikibugs>	 (03Abandoned) 10Hnowlan: admin_ng: increase limits for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160767 (owner: 10Hnowlan)
[10:45:04] <wikibugs>	 (03Abandoned) 10Hnowlan: Revert "changeprop: Remove rules related to parsoid (RB sunset)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159535 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan)
[10:47:38] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[10:48:55] <wikibugs>	 (03CR) 10Samwilson: [C:03+1] InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166155 (https://phabricator.wikimedia.org/T377978) (owner: 10Samtar)
[10:49:51] <effie>	 !log starting staged rollout of  Excimer to 1.2.5  mw-debug first, mw-api-int next 
[10:49:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:26] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[10:51:59] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[10:54:19] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[10:54:35] <TheresNoTime>	 jouncebot: nowandnext
[10:54:36] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1000)
[10:54:36] <jouncebot>	 In 1 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1200)
[10:55:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166155 (https://phabricator.wikimedia.org/T377978) (owner: 10Samtar)
[10:57:05] <effie>	 TheresNoTime: I am running a deployment that will take quite a long time 
[10:57:38] <TheresNoTime>	 effie: no worries, I've scheduled what I was going to do for an actual backport window :)
[10:57:46] <effie>	 cool! 
[10:59:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612 (10cmooney) 03NEW p:05Triage→03High
[11:01:09] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[11:03:32] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:03:40] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:04:12] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[11:04:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet
[11:05:23] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[11:05:30] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[11:05:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10972002 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_...
[11:06:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:06:30] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[11:07:43] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[11:10:54] <logmsgbot>	 cmooney@cumin1003 netbox (PID 101345) is awaiting input
[11:11:53] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entry for rgw.codfw.dpe.anycast.wmnet - cmooney@cumin1003"
[11:11:58] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entry for rgw.codfw.dpe.anycast.wmnet - cmooney@cumin1003"
[11:11:58] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:12:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet
[11:13:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] service: add discovery active/active config [puppet] - 10https://gerrit.wikimedia.org/r/1164458 (https://phabricator.wikimedia.org/T397618) (owner: 10Hnowlan)
[11:15:08] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[11:15:48] <effie>	 !log starting staged rollout of  Excimer to 1.2.5,  mw-api-ext
[11:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:54] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[11:16:20] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[11:17:35] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1004.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002
[11:18:08] <wikibugs>	 (03PS1) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561)
[11:18:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[11:20:08] <wikibugs>	 (03PS2) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561)
[11:21:04] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[11:21:10] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[11:21:54] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[11:24:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10972037 (10Stevemunene) 05Open→03Resolved >>! In T390176#10967599, @Jclark-ctr wrote: > @...
[11:24:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10972040 (10Stevemunene)
[11:25:03] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[11:26:36] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[11:26:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613 (10Stevemunene) 03NEW
[11:27:49] <logmsgbot>	 !log jiji@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: T397907 - Upgrade Excimer to 1.2.5 in production in progress, blocking deploys (duration: 44m 16s)
[11:27:52] <stashbot>	 T397907: Upgrade Excimer to 1.2.5 in production - https://phabricator.wikimedia.org/T397907
[11:28:32] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[11:29:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti6003.drmrs.wmnet to cluster drmrs01 and group B12
[11:30:00] <logmsgbot>	 !log jiji@deploy1003 Started scap sync-world: T397907 - Upgrade Excimer to 1.2.5 in production
[11:31:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:32:46] <logmsgbot>	 jmm@cumin2002 addnode (PID 260760) is awaiting input
[11:33:49] <wikibugs>	 (03PS3) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561)
[11:35:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove jquery [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166168
[11:35:06] <logmsgbot>	 !log jiji@deploy1003 Finished scap sync-world: T397907 - Upgrade Excimer to 1.2.5 in production (duration: 06m 59s)
[11:35:09] <stashbot>	 T397907: Upgrade Excimer to 1.2.5 in production - https://phabricator.wikimedia.org/T397907
[11:36:14] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10972104 (10MoritzMuehlenhoff)
[11:36:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[11:37:12] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1005.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002
[11:37:22] <wikibugs>	 (03CR) 10Muehlenhoff: kubernetes: show also the image OS (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[11:37:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti6003.drmrs.wmnet to cluster drmrs01 and group B12
[11:38:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install6002.wikimedia.org to drbd
[11:38:48] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove jquery [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166168
[11:40:31] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166168 (owner: 10Muehlenhoff)
[11:41:02] <wikibugs>	 (03PS2) 10Volans: kubernetes: show also the image OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696)
[11:41:05] <wikibugs>	 (03CR) 10Volans: kubernetes: show also the image OS (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[11:41:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[11:41:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove jquery [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166168 (owner: 10Muehlenhoff)
[11:45:09] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply
[11:45:52] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply
[11:48:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job squid in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:51:35] <wikibugs>	 (03PS3) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534)
[11:53:48] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.0 [puppet] - 10https://gerrit.wikimedia.org/r/1166142 (https://phabricator.wikimedia.org/T394382) (owner: 10Jelto)
[11:55:57] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:56:12] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:56:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install6002.wikimedia.org to drbd
[11:56:33] <icinga-wm>	 PROBLEM - Host install6002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:56:39] <icinga-wm>	 RECOVERY - Host install6002 is UP: PING OK - Packet loss = 0%, RTA = 87.38 ms
[11:57:40] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: set an-worker1176 and 1179 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1166170 (https://phabricator.wikimedia.org/T398027)
[11:58:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job squid in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1200)
[12:11:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165635 (https://phabricator.wikimedia.org/T398137) (owner: 10EggRoll97)
[12:13:38] <wikibugs>	 (03PS1) 10Cathal Mooney: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464)
[12:13:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:13:53] <wikibugs>	 (03PS2) 10EggRoll97: Allow abusefilter block action on plwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165635 (https://phabricator.wikimedia.org/T398137)
[12:13:55] <wikibugs>	 (03PS2) 10Cathal Mooney: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464)
[12:14:43] <wikibugs>	 (03PS1) 10Kosta Harlan: special: Do not throw ErrorPageError from getRedirect() [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167)
[12:15:14] <kostajh>	 jnuche: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1166178 should fix the logspam related to T398167
[12:15:14] <stashbot>	 T398167: MediaWiki\Exception\UserNotLoggedIn: Please log in to be able to access this page or action. - https://phabricator.wikimedia.org/T398167
[12:15:25] <kostajh>	 are you able to deploy it? 
[12:15:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[12:16:28] <jnuche>	 kostajh: yeah, but I'd feel more comfortable if someone else can take a look at it before we deploy it
[12:18:05] <kostajh>	 jnuche: I can have a look once it's staged on mwdebug 
[12:18:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:19:06] <wikibugs>	 (03PS3) 10Cathal Mooney: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464)
[12:19:25] <jnuche>	 kostajh: that sounds good, also I see that's already the cherry pick
[12:19:34] <jnuche>	 I'm ok with deploying
[12:19:41] <jnuche>	 let's do it
[12:20:15] <kostajh>	 cool
[12:20:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[12:20:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan)
[12:22:30] <wikibugs>	 (03CR) 10Volans: [C:03+2] kubernetes: show also the image OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[12:23:33] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: show also the image OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[12:23:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:27:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] special: Do not throw ErrorPageError from getRedirect() [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan)
[12:27:36] <wikibugs>	 (03PS1) 10Volans: Revert "Remove jquery" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166185
[12:28:38] <jnuche>	 kostajh: patch got a test failure
[12:28:43] <wikibugs>	 (03PS1) 10Mvolz: Update parameter name for wikidata/citoid integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166186 (https://phabricator.wikimedia.org/T361576)
[12:29:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:29:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update parameter name for wikidata/citoid integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166186 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz)
[12:29:33] <wikibugs>	 (03PS2) 10Mvolz: Update parameter name for wikidata/citoid integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166186 (https://phabricator.wikimedia.org/T361576)
[12:29:41] <kostajh>	 jnuche: hmm
[12:29:44] <wikibugs>	 (03PS1) 10Volans: JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696)
[12:30:00] <wikibugs>	 (03CR) 10Volans: [C:03+2] Revert "Remove jquery" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166185 (owner: 10Volans)
[12:30:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update parameter name for wikidata/citoid integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166186 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz)
[12:30:23] <kostajh>	 jnuche: that is unrelated, seems like a flaky test
[12:30:59] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Remove jquery" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166185 (owner: 10Volans)
[12:32:02] <wikibugs>	 (03CR) 10Jaime Nuche: "recheck" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan)
[12:32:29] <wikibugs>	 (03PS1) 10Volans: JS links: fix jquery version for bookworm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166192 (https://phabricator.wikimedia.org/T397696)
[12:32:47] <jnuche>	 kostajh: ack, retrying
[12:33:40] <wikibugs>	 (03PS2) 10Volans: JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696)
[12:34:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:36:45] <wikibugs>	 (03PS1) 10Effie Mouzeli: hieradata: migrate memcached gutter pool to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1166194 (https://phabricator.wikimedia.org/T398611)
[12:36:59] <wikibugs>	 (03PS3) 10Volans: JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696)
[12:38:37] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534#10972370 (10elukey) a:03elukey
[12:38:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:38:50] <wikibugs>	 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#10972371 (10elukey) a:03herron
[12:39:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:45:13] <jnuche>	 kostajh: failed again, seems like the same failure
[12:46:34] <kostajh>	 jnuche: OK. Maybe ping Growth team as maintainers of CommunityConfiguration? Is there a task for the test failure already? 
[12:46:44] <kostajh>	 cc urbanecm ^
[12:47:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[12:48:10] <wikibugs>	 (03PS4) 10Cathal Mooney: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464)
[12:48:35] <jnuche>	 kostajh: I didn't find anything with a quick Phab search
[12:48:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:49:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[12:50:25] <jnuche>	 kostajh: I need to leave soon for a doctor's appointment, sorry about that. andre should be able to help with the backport once the problem with the tests is solved
[12:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[12:51:07] <logmsgbot>	 !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@09893e3]: bump section topics to v1.7.0
[12:53:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:54:03] <jinxer-wm>	 RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:54:12] <logmsgbot>	 !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@09893e3]: bump section topics to v1.7.0 (duration: 03m 20s)
[12:54:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:59:14] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye
[12:59:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6001.drmrs.wmnet to drbd
[13:00:04] <jouncebot>	 Urbanecm and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1300).
[13:00:04] <jouncebot>	 TheresNoTime and EggRoll97: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <EggRoll97>	 o/
[13:00:18] <TheresNoTime>	 o/
[13:01:28] <wikibugs>	 (03PS5) 10Cathal Mooney: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464)
[13:03:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[13:03:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166155 (https://phabricator.wikimedia.org/T377978) (owner: 10Samtar)
[13:03:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165635 (https://phabricator.wikimedia.org/T398137) (owner: 10EggRoll97)
[13:04:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:04:27] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166155 (https://phabricator.wikimedia.org/T377978) (owner: 10Samtar)
[13:04:29] <wikibugs>	 (03Merged) 10jenkins-bot: Allow abusefilter block action on plwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165635 (https://phabricator.wikimedia.org/T398137) (owner: 10EggRoll97)
[13:04:46] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1166155|InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default (T377978)]], [[gerrit:1165635|Allow abusefilter block action on plwikiquote (T398137)]]
[13:04:51] <stashbot>	 T377978: [STORY] Template favouriting available on all foundation wikis - https://phabricator.wikimedia.org/T377978
[13:04:52] <stashbot>	 T398137: Allow blocking by abuse filter on Polish Wikiquote - https://phabricator.wikimedia.org/T398137
[13:05:15] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10972482 (10LSobanski) p:05Low→03Medium
[13:06:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166192 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[13:07:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1166194 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli)
[13:08:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:08:43] <logmsgbot>	 !log samtar@deploy1003 samtar, eggroll97: Backport for [[gerrit:1166155|InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default (T377978)]], [[gerrit:1165635|Allow abusefilter block action on plwikiquote (T398137)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:08:48] <TheresNoTime>	 EggRoll97: ready to test on mwdebug
[13:08:54] * TheresNoTime is also testing their patch
[13:10:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6001.drmrs.wmnet to drbd
[13:10:12] <icinga-wm>	 PROBLEM - Host ncredir6001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:10:28] <icinga-wm>	 RECOVERY - Host ncredir6001 is UP: PING OK - Packet loss = 0%, RTA = 87.47 ms
[13:10:42] <EggRoll97>	 TheresNoTime: lgtm
[13:11:21] <logmsgbot>	 !log samtar@deploy1003 samtar, eggroll97: Continuing with sync
[13:13:08] <wikibugs>	 (03PS8) 10Arnaudb: gerrit: sanity checks as a cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833)
[13:13:08] <wikibugs>	 (03CR) 10Arnaudb: "This cookbook is testable on any cumin host with:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[13:13:23] <wikibugs>	 (03CR) 10Ssingh: "1. confd/confctl. Essentially, this is the template:" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:13:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:15:35] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: sanity checks cookbook implementation [cookbooks] - 10https://gerrit.wikimedia.org/r/1165880 (https://phabricator.wikimedia.org/T387833)
[13:15:35] <wikibugs>	 (03CR) 10Arnaudb: "Implementation of the topology-check cookbook, some readability improvements" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165880 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[13:16:20] <wikibugs>	 (03CR) 10Volans: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[13:17:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6001.wikimedia.org to drbd
[13:18:32] <sukhe>	 !log sudo cumin 'C:bird' "disable-puppet 'merging CR 1163858'": T374619
[13:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:35] <stashbot>	 T374619: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619
[13:18:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:18:51] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166155|InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default (T377978)]], [[gerrit:1165635|Allow abusefilter block action on plwikiquote (T398137)]] (duration: 14m 04s)
[13:18:54] <stashbot>	 T377978: [STORY] Template favouriting available on all foundation wikis - https://phabricator.wikimedia.org/T377978
[13:18:55] <stashbot>	 T398137: Allow blocking by abuse filter on Polish Wikiquote - https://phabricator.wikimedia.org/T398137
[13:18:56] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage
[13:19:03] <TheresNoTime>	 EggRoll97: done :)
[13:19:11] <EggRoll97>	 TheresNoTime: another backport done! tysm
[13:20:01] <TheresNoTime>	 !log done UTC afternoon backport window
[13:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10972537 (10Jclark-ctr) @Stevemunene   This error was showing on console. looks like VD for os was cleared out at some...
[13:21:12] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:21:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10972540 (10Jclark-ctr) a:05Stevemunene→03Jclark-ctr
[13:21:39] <sukhe>	 !log sudo cumin -b11 'C:bird' "run-puppet-agent --enable 'merging CR 1163858'": NOOP change T374619
[13:21:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10972541 (10Jclark-ctr) Also updating bios and idrac firmware
[13:22:02] <icinga-wm>	 PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100%
[13:22:08] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage
[13:22:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1166082 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto)
[13:22:29] <wikibugs>	 (03Abandoned) 10Stevemunene: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135049 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene)
[13:22:51] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] CAS: Add wmf group for Zarcillo, remove ops [puppet] - 10https://gerrit.wikimedia.org/r/1166082 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto)
[13:23:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:24:30] <icinga-wm>	 RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms
[13:24:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:26:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6001.wikimedia.org to drbd
[13:26:42] <icinga-wm>	 PROBLEM - Host doh6001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:27:14] <icinga-wm>	 RECOVERY - Host doh6001 is UP: PING OK - Packet loss = 0%, RTA = 87.64 ms
[13:27:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Add site-specific Cumin alias for aux cluster [puppet] - 10https://gerrit.wikimedia.org/r/1166200
[13:28:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6001.drmrs.wmnet to drbd
[13:28:39] <wikibugs>	 (03PS2) 10Muehlenhoff: Add site-specific Cumin alias for aux cluster [puppet] - 10https://gerrit.wikimedia.org/r/1166200
[13:28:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:28:55] <wikibugs>	 (03PS6) 10Volans: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[13:29:14] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:30:01] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:bird and C:bird::anycast: support exporting Prom metrics [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:30:19] <wikibugs>	 06SRE, 06DBA, 06serviceops, 05MW-1.44-notes, and 2 others: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10972588 (10Clement_Goubert)
[13:30:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:30:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[13:31:17] <wikibugs>	 (03PS7) 10Volans: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[13:31:47] <wikibugs>	 (03PS3) 10Ssingh: hiera: enable exporting anycast-hc prom metrics for O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619)
[13:32:46] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6137/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:33:14] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:33:54] <icinga-wm>	 PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100%
[13:34:38] <claime>	 jouncebot: nowandnext
[13:34:38] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1300)
[13:34:38] <jouncebot>	 In 0 hour(s) and 55 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1430)
[13:34:57] <claime>	 I'm gonna reboot the docker-registry nodes
[13:35:35] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet
[13:35:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10972622 (10Jclark-ctr) @cmooney i am available to assist
[13:36:40] <icinga-wm>	 RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms
[13:38:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6001.drmrs.wmnet to drbd
[13:39:13] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye
[13:39:50] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:40:05] <sukhe>	 ^ this is expected and unrelated to the other bird change being rolled out, which is a NOOP
[13:40:18] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet
[13:40:34] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry2005.codfw.wmnet
[13:40:53] <moritzm>	 !log installing libxml2 security updates on bookworm
[13:40:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:41:48] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:42:12] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:42:14] <vgutierrez>	 registry2004 is expected?
[13:42:19] <sukhe>	 vgutierrez: yeah, reboot
[13:42:22] <vgutierrez>	 ack
[13:42:42] <claime>	 yeah no something's wrong with the restart 
[13:42:50] <claime>	 Jul 03 13:42:16 registry2004 docker-registry[2789]: configuration error: open /etc/docker/registry/config.yml: no such file or directory
[13:43:17] <claime>	 It's apparently not taken into account by the restart/repool cookbook's check before repool
[13:43:57] <jinxer-wm>	 FIRING: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:44:14] <claime>	 expected, on it
[13:44:15] <_joe_>	 yeah the docker registry is down
[13:44:17] <sukhe>	 !incidents
[13:44:18] <sirenbot>	 6453 (UNACKED)  ProbeDown sre (10.2.1.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 codfw)
[13:44:18] <sirenbot>	 6452 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin)
[13:44:18] <sirenbot>	 6451 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin)
[13:44:18] <sirenbot>	 6450 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[13:44:19] <sirenbot>	 6448 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet)
[13:44:19] <claime>	 yeah i know
[13:44:19] <sirenbot>	 6449 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[13:44:19] <sirenbot>	 6447 (RESOLVED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[13:44:19] <sirenbot>	 6446 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[13:44:22] <sukhe>	 !ack 6453
[13:44:23] <sirenbot>	 6453 (ACKED)  ProbeDown sre (10.2.1.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 codfw)
[13:44:40] <vgutierrez>	 slyngs: ^^ could you take that one? I'm on a meeting with Kwaku at the moment
[13:44:48] <slyngs>	 Sure
[13:44:58] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Changed the patch to roll out only one DNS host to all. Since it's just exporting Prom metrics, I think it's fine." [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:45:00] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2005.codfw.wmnet
[13:45:10] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "I meant *DOH* host not *DNS* :)" [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:46:22] <sukhe>	 !log sudo cumin 'A:wikidough' "disable-puppet 'merging CR 1163859'"
[13:46:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:46:34] <icinga-wm>	 PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100%
[13:46:43] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet
[13:46:46] <_joe_>	 the docker-registry service is probably a leftover
[13:46:57] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: enable exporting anycast-hc prom metrics for O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:46:59] <_joe_>	 given we have multiple registry instances running 
[13:47:20] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet
[13:47:34] <claime>	 yeah there are two config files
[13:47:48] <claime>	 what's weird is why is it down
[13:47:54] <claime>	 there are two instances running
[13:48:08] <slyngs>	 It's the CODFW one
[13:48:19] <claime>	 I know
[13:48:21] <_joe_>	 claime: yeah I have no idea how it works, but on 1004 it's the same situation
[13:48:23] <slyngs>	 Okay :-)
[13:48:44] <claime>	 There are two registry instancres running on the registry servers, one with apus backend and one with swift
[13:48:54] <claime>	 and the nginx config is supposed to serve them
[13:48:55] <_joe_>	 yes
[13:48:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:49:02] <icinga-wm>	 RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms
[13:49:15] <claime>	 so registry2005 rebooted correctly
[13:49:16] <_joe_>	 when restarting what I think happened is that the old service tried to take over a port or something
[13:49:30] <claime>	 ● docker-registry.service loaded failed failed the Docker toolset to pack, ship, store, and deliver content
[13:50:17] <_joe_>	 claime: yes that service needs to be manually eradicated I guess
[13:50:24] <claime>	 yeah
[13:50:26] <_joe_>	 it was removed but not absented
[13:50:46] <_joe_>	 in any case; the registry is back up?
[13:51:16] <claime>	 yeah, but I don't understand why it came back up correctly on 2005 and not on 2004
[13:51:16] <_joe_>	 yes it is
[13:51:28] <_joe_>	 it is serving traffic from 2004?
[13:51:44] <_joe_>	 it is
[13:51:53] <_joe_>	 so it came back correctly there too, eventually
[13:52:01] <_joe_>	 did you start anything manually?
[13:52:06] <claime>	 no
[13:52:21] <_joe_>	 so it is actually working
[13:52:31] <wikibugs>	 (03PS1) 10Ssingh: Revert "hiera: enable exporting anycast-hc prom metrics for O:wikidough" [puppet] - 10https://gerrit.wikimedia.org/r/1166203
[13:52:34] <_joe_>	 maybe takes some time to startup?
[13:52:50] <_joe_>	 I'm not even sure it properly failed
[13:53:21] <claime>	 it nee3ded puppet to run from what I can see from the puppet logs
[13:53:41] <_joe_>	 what did the puppet run do? start the services?
[13:53:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:53:58] <claime>	 _joe_: yrah
[13:54:04] <claime>	 2025-07-03T13:39:18.050082+00:00 registry2004 puppet-agent[1248]: (/Stage[main]/Nginx/Service[nginx]/ensure) ensure changed 'stopped' to 'running' (corrective)
[13:54:12] <claime>	 2025-07-03T13:39:19.058234+00:00 registry2004 puppet-agent[1248]: (/Stage[main]/Docker_registry::Web/Jwt_authorizer::Service[docker-registry-ha-jwt]/Systemd::Service[docker-registry-ha-jwt]/Service[docker-registry-ha-jwt]/ensure) ensure changed 'stopped' to 'running' (corrective)
[13:54:27] <_joe_>	 wait wat
[13:54:44] <_joe_>	 so nginx doesn't start on reboot? lol
[13:54:51] <slyngs>	 :-)
[13:54:53] <claime>	 I'mma look at the puppet code
[13:55:10] <_joe_>	 I would look at the logs from the servers and the setup of systemd there
[13:55:23] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "hiera: enable exporting anycast-hc prom metrics for O:wikidough" [puppet] - 10https://gerrit.wikimedia.org/r/1166203 (owner: 10Ssingh)
[13:55:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:55:52] <slyngs>	 We can also change the ensure => stopped on the registry to absent
[13:56:09] <_joe_>	 that's probably a good idea for that service
[13:56:10] <claime>	 yes, that's what I'm doing
[13:56:22] <claime>	 plus checking what the nginx systemd definition is doing
[13:56:23] <_joe_>	 but that's not why nginx failed to start, I'd assume
[13:56:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10972687 (10cmooney) @Jclark-ctr has replaced the optics both side of the link.  Link is up and light levels healthy, we'll see how it goe...
[13:57:27] * vgutierrez back
[13:57:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10972688 (10Jclark-ctr) Replaced both optics no spares on site now at eqiad
[13:57:36] <slyngs>	 No, it failed to start and then Puppet startet it. I'm think permission bind() to unix:/var/run/nginx-auth/basic.sock faile
[13:58:06] <claime>	 yeah so
[13:58:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10972689 (10Jclark-ctr) sr4 optics black handle  @RobH
[13:58:16] <wikibugs>	 (03PS1) 10Ssingh: hiera: enable exporting anycast-hc prom metrics for O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1166204 (https://phabricator.wikimedia.org/T374619)
[13:58:22] <_joe_>	 it needs the ha-service up before nginx
[13:58:24] <claime>	 the nginx service cleans up the /var/run/nginx-auth/ dir as PostExecStop
[13:58:28] <_joe_>	 so we need to declare the dependency
[13:58:28] <claime>	 This is created by puppet
[13:58:39] <claime>	 So it's not present on boot
[13:58:52] <claime>	 so when it tries to start it just can't until puppet runs
[13:58:53] <_joe_>	 that dir is created by puppet?
[13:58:55] <claime>	 yeah
[13:59:13] <_joe_>	 yeah it needs to be in tmpfiles I guess
[13:59:16] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6138/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166204 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:59:24] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet
[13:59:27] <claime>	 modules/docker_registry/manifests/web.pp L144 and following
[14:00:30] <_joe_>	 yeah so, either we create the dir before nginx starts in ExecStartPre
[14:00:40] <_joe_>	 which is probably a good idea
[14:00:47] <vgutierrez>	 what's the purpose of deleting it after service stops?
[14:00:56] <claime>	 https://trac.nginx.org/nginx/ticket/753
[14:01:03] <_joe_>	 or we do that before the auth service starts
[14:01:04] <claime>	 we're removing the socket only
[14:01:10] <claime>	 not the dir, my bad
[14:01:48] <vgutierrez>	 it doesn't matter
[14:01:50] <vgutierrez>	 it's on /var/run
[14:01:57] <_joe_>	 yep
[14:01:58] <vgutierrez>	 so that's a tmpfs that will be "empty" on system reboot
[14:02:02] <claime>	 yeah
[14:02:13] <vgutierrez>	 as _joe_ mentioned we need to move that to a tmpfile.d config
[14:02:16] <claime>	 I'm for adding it to ExecStartPre
[14:02:16] <_joe_>	 and the dir needs to be created
[14:02:28] <_joe_>	 either/or
[14:02:30] <vgutierrez>	 and/or ExecStartPre
[14:02:38] <vgutierrez>	 as long as it doesn't fail if it's already there :D
[14:02:41] <_joe_>	 ExecStartPre is easier to understand
[14:02:43] <vgutierrez>	 mkdir -p should work
[14:02:49] <claime>	 We can keep the puppet definition for rights and stuff
[14:02:55] <slyngs>	 Add it as ExecStartPre and remove as a dir Puppet creates
[14:02:57] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] prometheus: add dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[14:03:10] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet
[14:03:26] <_joe_>	 claime: that puppet declaration makes no sense; it should be in tmpfiles.d
[14:03:36] <_joe_>	 and I think we even have a puppet define for it
[14:03:46] <vgutierrez>	 yep, we have tmpfiles.d in puppet
[14:04:05] <_joe_>	 systemd::tmpfile
[14:04:06] <sukhe>	 systemd::tmpfile
[14:04:17] <sukhe>	 we use it a bunch of places 
[14:04:17] <claime>	 jinx
[14:04:19] <vgutierrez>	 happy to patch it if needed
[14:04:34] <claime>	 vgutierrez: it's ok I'm on it
[14:04:40] <vgutierrez>	 cool
[14:05:08] <moritzm>	 archiva does this already, can be copied from there
[14:05:37] <moritzm>	 !log restarting clamav to pick up libxml security updates
[14:05:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:28] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] prometheus: add dnsbox_service_state_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[14:08:30] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1006.eqiad.wmnet with reason: Maintenance and reboot
[14:09:23] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1007.eqiad.wmnet with reason: Maintenance and reboot
[14:09:39] <wikibugs>	 (03PS1) 10Muehlenhoff: crm: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/1166205 (https://phabricator.wikimedia.org/T135991)
[14:11:33] <wikibugs>	 (03PS1) 10C. Scott Ananian: skin: Omit "rendered with" phrase when the message is disabled [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616)
[14:12:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) (owner: 10C. Scott Ananian)
[14:12:50] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:13:16] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:13:58] <claime>	 Also I got tricked again, I started with codfw because I thought it was secondary
[14:14:07] <claime>	 forgot they were inverse pooled..,
[14:14:20] <wikibugs>	 (03PS1) 10Muehlenhoff: k8s: Add missing definitions for aux-codfw [cookbooks] - 10https://gerrit.wikimedia.org/r/1166209
[14:14:33] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add site-specific Cumin alias for aux cluster [puppet] - 10https://gerrit.wikimedia.org/r/1166200 (owner: 10Muehlenhoff)
[14:14:50] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:15:06] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:15:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add site-specific Cumin alias for aux cluster [puppet] - 10https://gerrit.wikimedia.org/r/1166200 (owner: 10Muehlenhoff)
[14:15:40] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:15:42] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:15:43] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#10972817 (10ABran-WMF)
[14:17:18] <vgutierrez>	 !log depooling cp7006 for testing
[14:17:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:34] <wikibugs>	 (03PS29) 10Elukey: Add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696)
[14:20:28] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: enable exporting anycast-hc prom metrics for O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1166204 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[14:21:08] <wikibugs>	 (03PS1) 10Ssingh: P:dns::auth::monitoring: add prometheus::dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1166210 (https://phabricator.wikimedia.org/T374619)
[14:21:18] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: enable exporting anycast-hc prom metrics for O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1166204 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[14:22:16] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove puppetserver2003 from active Puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1166160 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff)
[14:23:33] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye
[14:23:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10972864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worke...
[14:24:02] <wikibugs>	 (03CR) 10Elukey: "Left aa nit but it looks good to me. Maybe for extra security, let's run PCC to confirm?" [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff)
[14:24:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] skin: Omit "rendered with" phrase when the message is disabled [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) (owner: 10C. Scott Ananian)
[14:25:25] <wikibugs>	 (03PS4) 10Clément Goubert: docker_registry: Move nginx auth socket to tmpfiles [puppet] - 10https://gerrit.wikimedia.org/r/1166208
[14:28:08] <wikibugs>	 (03PS2) 10Muehlenhoff: k8s: Add missing definitions for aux-codfw [cookbooks] - 10https://gerrit.wikimedia.org/r/1166209
[14:30:04] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1430)
[14:30:44] <wikibugs>	 (03CR) 10Volans: "Nice! I finally found the time for a full pass, sorry for the delay. Consider pretty much all comments as optional except the reported bug" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[14:30:47] <wikibugs>	 (03CR) 10Muehlenhoff: "Sure: PCC at https://puppet-compiler.wmflabs.org/output/1166161/6135/" [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff)
[14:31:20] <wikibugs>	 (03CR) 10Volans: [C:03+2] JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[14:31:34] <wikibugs>	 (03CR) 10Volans: [C:03+2] JS links: fix jquery version for bookworm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166192 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[14:31:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove puppetserver role frm puppetserver2003 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607)
[14:32:13] <wikibugs>	 (03Merged) 10jenkins-bot: JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[14:32:18] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] docker_registry: Move nginx auth socket to tmpfiles [puppet] - 10https://gerrit.wikimedia.org/r/1166208 (owner: 10Clément Goubert)
[14:32:26] <wikibugs>	 (03Merged) 10jenkins-bot: JS links: fix jquery version for bookworm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166192 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[14:32:36] <moritzm>	 !log installing bootstrap4 security updates
[14:32:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:27] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Haven't tested it but looks sane to me." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[14:33:31] <wikibugs>	 (03PS3) 10Muehlenhoff: k8s: Add missing definitions for aux-codfw [cookbooks] - 10https://gerrit.wikimedia.org/r/1166209
[14:34:03] <wikibugs>	 (03CR) 10Elukey: [C:03+1] k8s: Add missing definitions for aux-codfw [cookbooks] - 10https://gerrit.wikimedia.org/r/1166209 (owner: 10Muehlenhoff)
[14:37:24] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#10972892 (10ssingh) Happy to collaborate on this, FWIW.
[14:38:47] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Maintenance
[14:38:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10972894 (10MoritzMuehlenhoff)
[14:38:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2213 (T395241)', diff saved to https://phabricator.wikimedia.org/P78751 and previous config saved to /var/cache/conftool/dbconfig/20250703-143854-fceratto.json
[14:39:54] <wikibugs>	 (03PS2) 10Clément Goubert: docker_registry: Disable service [puppet] - 10https://gerrit.wikimedia.org/r/1166213
[14:40:15] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] k8s: Add missing definitions for aux-codfw [cookbooks] - 10https://gerrit.wikimedia.org/r/1166209 (owner: 10Muehlenhoff)
[14:40:49] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] docker_registry: Disable service [puppet] - 10https://gerrit.wikimedia.org/r/1166213 (owner: 10Clément Goubert)
[14:40:56] <wikibugs>	 (03CR) 10Volans: Add support for kubernetes (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[14:42:56] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Remove puppetserver2003 from serving requests [dns] - 10https://gerrit.wikimedia.org/r/1166159 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff)
[14:43:17] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-worker-eqiad
[14:43:22] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Remove puppetserver2003 from active Puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1166160 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff)
[14:44:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver2003 from serving requests [dns] - 10https://gerrit.wikimedia.org/r/1166159 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff)
[14:44:10] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[14:44:33] <wikibugs>	 (03PS1) 10Ssingh: Revert "hiera: enable exporting anycast-hc prom metrics for O:wikidough" [puppet] - 10https://gerrit.wikimedia.org/r/1166215
[14:45:08] <wikibugs>	 (03CR) 10Ssingh: "Will resolve below and try again:" [puppet] - 10https://gerrit.wikimedia.org/r/1166215 (owner: 10Ssingh)
[14:45:14] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[14:45:22] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry1004.eqiad.wmnet
[14:45:44] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "hiera: enable exporting anycast-hc prom metrics for O:wikidough" [puppet] - 10https://gerrit.wikimedia.org/r/1166215 (owner: 10Ssingh)
[14:45:56] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.6.6 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166216
[14:46:05] <wikibugs>	 (03CR) 10Volans: [C:03+2] Upstream release v0.6.6 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166216 (owner: 10Volans)
[14:46:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T395241)', diff saved to https://phabricator.wikimedia.org/P78752 and previous config saved to /var/cache/conftool/dbconfig/20250703-144619-fceratto.json
[14:47:00] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v0.6.6 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166216 (owner: 10Volans)
[14:48:15] <vgutierrez>	 !log repooling cp7006
[14:48:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:56] <wikibugs>	 (03CR) 10Volans: "Did a full pass, LGTM, beside the previous possible bug" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[14:49:49] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1004.eqiad.wmnet
[14:50:14] <volans>	 !log uploaded debmonitor-server,python3-debmonitor_0.6.6 to apt.wikimedia.org bookworm-wikimedia
[14:50:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:36] <wikibugs>	 (03PS30) 10Elukey: Add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696)
[14:50:43] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1006.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002
[14:50:47] <wikibugs>	 (03CR) 10Elukey: Add support for kubernetes (032 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[14:50:53] <wikibugs>	 (03PS4) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561)
[14:51:16] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry1005.eqiad.wmnet
[14:53:02] <wikibugs>	 (03PS17) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316
[14:55:05] <wikibugs>	 (03CR) 10JHathaway: dhcp: add a UUID based DHCP config (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway)
[14:55:49] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[14:55:54] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1005.eqiad.wmnet
[14:56:56] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1007.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002
[15:00:05] <jouncebot>	 jnuche and jeena: That opportune time for a Train log triage deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1500).
[15:01:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P78753 and previous config saved to /var/cache/conftool/dbconfig/20250703-150126-fceratto.json
[15:02:06] <wikibugs>	 (03PS31) 10Elukey: Add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696)
[15:02:16] <wikibugs>	 (03CR) 10Elukey: Add support for kubernetes (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[15:04:37] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-worker-eqiad
[15:04:37] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove puppetserver role frm puppetserver2003 for decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff)
[15:04:50] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-worker-codfw
[15:06:00] <wikibugs>	 (03PS7) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317
[15:06:31] <wikibugs>	 (03PS5) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:38] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[15:07:44] <wikibugs>	 (03CR) 10JHathaway: reimage: add support for using the host UUID for DHCP (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[15:09:10] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[15:10:42] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs5006.eqsin.wmnet with reason: katran migration
[15:10:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs5006 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1166134 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[15:14:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[15:14:56] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] hdfs: set an-worker1176 and 1179 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1166170 (https://phabricator.wikimedia.org/T398027) (owner: 10Stevemunene)
[15:16:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P78754 and previous config saved to /var/cache/conftool/dbconfig/20250703-151633-fceratto.json
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:21:44] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs5006.eqsin.wmnet
[15:21:44] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs5006.eqsin.wmnet
[15:22:29] <vgutierrez>	 !log lvs5006 migrated to katran - T396561
[15:22:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:32] <stashbot>	 T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561
[15:23:32] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:24:41] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway)
[15:25:46] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-worker-codfw
[15:28:29] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: use latest build of ratelimit service in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166221 (https://phabricator.wikimedia.org/T388804)
[15:28:32] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[15:29:17] <wikibugs>	 (03PS1) 10Ssingh: bird/anycast-hc: allow setting SupplementaryGroups for anycast-hc unit [puppet] - 10https://gerrit.wikimedia.org/r/1166222 (https://phabricator.wikimedia.org/T374619)
[15:29:19] <wikibugs>	 (03PS1) 10Ssingh: hiera: dnsbox: set supplementary_groups for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619)
[15:30:49] <wikibugs>	 (03PS2) 10Ssingh: hiera: dnsbox: set supplementary_groups for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619)
[15:31:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T395241)', diff saved to https://phabricator.wikimedia.org/P78755 and previous config saved to /var/cache/conftool/dbconfig/20250703-153141-fceratto.json
[15:31:53] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6142/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[15:32:48] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] api-gateway: use latest build of ratelimit service in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166221 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan)
[15:33:16] <vgutierrez>	 !log depooling cp7006 for testing
[15:33:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:28] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7006.magru.wmnet with reason: testing
[15:38:48] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye
[15:38:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10973083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker117...
[15:39:45] <wikibugs>	 (03PS1) 10Ssingh: C:prometheus: dnsbox_service_state_exporter s/define/class [puppet] - 10https://gerrit.wikimedia.org/r/1166224 (https://phabricator.wikimedia.org/T374619)
[15:40:12] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway)
[15:41:38] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] api-gateway: use latest build of ratelimit service in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166221 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan)
[15:42:16] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye
[15:42:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10973107 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worke...
[15:43:19] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: use latest build of ratelimit service in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166221 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan)
[15:46:26] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[15:46:46] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[15:47:18] <wikibugs>	 (03PS1) 10Ssingh: team-traffic: add dnsbox alert for service status mistmatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619)
[15:48:03] <wikibugs>	 (03PS8) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317
[15:48:11] <wikibugs>	 (03PS2) 10Ssingh: team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619)
[15:48:15] <wikibugs>	 (03CR) 10JHathaway: reimage: add support for using the host UUID for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[15:48:41] <wikibugs>	 (03PS6) 10Abijeet Patro: CX: Add virtual-cx-shared DatabaseVirtualDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513)
[15:49:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[15:52:02] <wikibugs>	 (03PS3) 10Ssingh: team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619)
[15:52:40] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[15:52:54] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[15:54:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[15:57:14] <wikibugs>	 (03CR) 10Ssingh: "This is not really ready for review as I need to verify the metrics and the expression. But it's a good template for later so leaving it h" [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[15:57:32] <wikibugs>	 (03CR) 10Volans: [C:03+1] "To be tested with a new spicerack release for the dhcp changes but LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[16:00:05] <jouncebot>	 jhathaway and moritzm: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1600). nyaa~
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:02:08] <MichaelG_WMF>	 jouncebot: nowandnext
[16:02:09] <jouncebot>	 For the next 0 hour(s) and 57 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1600)
[16:02:09] <jouncebot>	 In 0 hour(s) and 57 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1700)
[16:02:09] <jouncebot>	 In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1700)
[16:03:07] <wikibugs>	 (03PS1) 10Michael Große: tests: skip test to allow updating CommunityConfigurationExample [extensions/CommunityConfiguration] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166226
[16:03:27] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] hdfs: set an-worker1176 and 1179 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1166170 (https://phabricator.wikimedia.org/T398027) (owner: 10Stevemunene)
[16:05:54] <wikibugs>	 (03PS2) 10Michael Große: tests: skip test to allow updating CommunityConfigurationExample [extensions/CommunityConfiguration] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166226 (https://phabricator.wikimedia.org/T398624)
[16:09:21] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7006.magru.wmnet
[16:09:22] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7006.magru.wmnet
[16:11:44] <vgutierrez>	 !log repooling cp7006
[16:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:07] <wikibugs>	 (03PS1) 10Federico Ceratto: zarcillo: Update egress to idp.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166227 (https://phabricator.wikimedia.org/T384810)
[16:12:07] <wikibugs>	 (03CR) 10Federico Ceratto: "A small update to the egress conf, already tested" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166227 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[16:13:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390064#10973219 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[16:14:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10973221 (10Jclark-ctr) a:03Jclark-ctr
[16:15:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10973222 (10Jclark-ctr) 05Open→03Resolved
[16:16:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397983#10973223 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[16:17:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397852#10973225 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[16:25:18] <wikibugs>	 (03PS1) 10Hnowlan: Revert "api-gateway: use latest build of ratelimit service in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166229
[16:27:48] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Revert "api-gateway: use latest build of ratelimit service in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166229 (owner: 10Hnowlan)
[16:29:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "api-gateway: use latest build of ratelimit service in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166229 (owner: 10Hnowlan)
[16:31:50] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[16:32:13] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[16:32:14] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[16:32:27] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[16:35:27] <wikibugs>	 (03PS6) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561)
[16:44:20] <wikibugs>	 (03PS7) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561)
[16:47:04] <wikibugs>	 (03CR) 10Abijeet Patro: CX: Add virtual-cx-shared DatabaseVirtualDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro)
[16:47:21] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[16:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[16:52:38] <wikibugs>	 (03PS8) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561)
[17:00:05] <jouncebot>	 bd808: #bothumor I � Unicode. All rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1700)
[17:01:04] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye
[17:01:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10973309 (10dcaro) >>! In T394333#10964951, @Andrew wrote: >>>! In T394333#10964303, @ayounsi wrote: >> @Andrew Would it be possible to use a single 25G up...
[17:01:11] <wikibugs>	 (03PS9) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561)
[17:01:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10973310 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-...
[17:03:35] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[17:07:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650 (10aranyap) 03NEW
[17:08:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10973332 (10aranyap)
[17:09:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[17:10:59] <wikibugs>	 (03Merged) 10jenkins-bot: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney)
[17:12:24] <icinga-wm>	 RECOVERY - Backup freshness on backup1014 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[17:12:47] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[17:12:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:13:01] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[17:13:09] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[17:13:37] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[17:15:27] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1176.eqiad.wmnet with reason: host reimage
[17:17:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10973370 (10SLopes-WMF) As @aranyap's manager, I approve this request.
[17:18:47] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1176.eqiad.wmnet with reason: host reimage
[17:20:41] <wikibugs>	 (03PS1) 10Volans: UI: improve table grouping by column [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166233 (https://phabricator.wikimedia.org/T397696)
[17:22:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:23:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs - https://phabricator.wikimedia.org/T398464#10973378 (10cmooney) 05Open→03Resolved a:03cmooney
[17:24:38] <wikibugs>	 (03CR) 10Volans: "Let me know what do you think" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166233 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[17:24:40] <logmsgbot>	 !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@9088e59]: Synchronize artifacat for airflow_dags/analytics_test
[17:24:47] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10973383 (10cmooney) FWIW I seen an interesting talk from the latest Nanog conference about "return loss" on shorter and faster links which can c...
[17:24:55] <logmsgbot>	 !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@9088e59]: Synchronize artifacat for airflow_dags/analytics_test (duration: 00m 15s)
[17:25:41] <logmsgbot>	 !log joal@deploy1003 Started deploy [airflow-dags/analytics@9088e59]: Synchronize artifacts for airflow_dags/analytics
[17:26:22] <logmsgbot>	 !log joal@deploy1003 Finished deploy [airflow-dags/analytics@9088e59]: Synchronize artifacts for airflow_dags/analytics (duration: 00m 40s)
[17:33:11] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1176.eqiad.wmnet with OS bullseye
[17:33:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10973400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-work...
[17:43:35] <wikibugs>	 (03CR) 10A smart kitten: "The CI failure is T398624 (& maybe following 48b91a15 it won't happen any more?)" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) (owner: 10C. Scott Ananian)
[17:46:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:59:08] <wikibugs>	 (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) (owner: 10C. Scott Ananian)
[17:59:40] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Use FallbackContentHandler for undeployed JsonConfig content handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166236 (https://phabricator.wikimedia.org/T124748)
[18:00:04] <jouncebot>	 jnuche and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1800).
[18:01:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166236 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński)
[18:09:46] <wikibugs>	 (03PS1) 10Ssingh: C:bird::anycast_healthchecker: notify service on conf file change [puppet] - 10https://gerrit.wikimedia.org/r/1166238
[18:10:48] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6143/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166238 (owner: 10Ssingh)
[19:09:28] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Inline image is displayed incorrectly - https://phabricator.wikimedia.org/T398660 (10matej_suchanek) 03NEW
[19:12:03] <wikibugs>	 (03PS2) 10Arlolra: ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166012 (https://phabricator.wikimedia.org/T390798)
[19:12:16] <wikibugs>	 (03CR) 10Arlolra: ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166012 (https://phabricator.wikimedia.org/T390798) (owner: 10Arlolra)
[19:28:32] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[19:29:04] <wikibugs>	 (03CR) 10Atieno: [C:03+1] ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166012 (https://phabricator.wikimedia.org/T390798) (owner: 10Arlolra)
[19:31:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166012 (https://phabricator.wikimedia.org/T390798) (owner: 10Arlolra)
[19:34:41] <logmsgbot>	 !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@7ba4a7b]: BUGFIX - Synchronize artifact for airflow_dags/analytics_test
[19:34:57] <logmsgbot>	 !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@7ba4a7b]: BUGFIX - Synchronize artifact for airflow_dags/analytics_test (duration: 00m 16s)
[19:35:45] <logmsgbot>	 !log joal@deploy1003 Started deploy [airflow-dags/analytics@7ba4a7b]: BUGFIX - Synchronize artifact for airflow_dags/analytics
[19:36:47] <logmsgbot>	 !log joal@deploy1003 Finished deploy [airflow-dags/analytics@7ba4a7b]: BUGFIX - Synchronize artifact for airflow_dags/analytics (duration: 01m 02s)
[19:53:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166133 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T2000). Please do the needful.
[20:00:05] <jouncebot>	 cscott, MatmaRex, arlolra, and zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:40] <zabe>	 o/
[20:01:42] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Use correct index on categorylinks [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166133 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe)
[20:01:59] <zabe>	 since no one else appears to already be here, I will start with mine
[20:02:22] <arlolra>	 here
[20:02:42] <wikibugs>	 (03Merged) 10jenkins-bot: Use correct index on categorylinks [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166133 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe)
[20:03:26] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1166133|Use correct index on categorylinks (T385890)]]
[20:03:29] <stashbot>	 T385890: Add support for read new for categorylinks migration - https://phabricator.wikimedia.org/T385890
[20:05:23] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1166133|Use correct index on categorylinks (T385890)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:06:09] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[20:08:55] <cscott>	 woo i lost track of time but i'm here now
[20:11:58] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166133|Use correct index on categorylinks (T385890)]] (duration: 08m 32s)
[20:12:02] <stashbot>	 T385890: Add support for read new for categorylinks migration - https://phabricator.wikimedia.org/T385890
[20:12:12] <zabe>	 Ok I am done
[20:12:37] <cscott>	 i'll jump in next?
[20:12:45] <cscott>	 i'm spiderpigging it
[20:12:50] <kostajh>	 if anyone would like to deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1166178, that would be nice
[20:13:20] <kostajh>	 I don't know if a train conductor is around. It would end the logspam associated with T398167
[20:13:20] <stashbot>	 T398167: MediaWiki\Exception\UserNotLoggedIn: Please log in to be able to access this page or action. - https://phabricator.wikimedia.org/T398167
[20:13:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) (owner: 10C. Scott Ananian)
[20:13:46] <zabe>	 it fails CI, doesn't it?
[20:13:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:14:01] <kostajh>	 it should be fixed now, as a patch to CommunityConfiguration was backported
[20:14:09] <wikibugs>	 (03CR) 10Kosta Harlan: "recheck" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan)
[20:14:14] <wikibugs>	 (03PS2) 10Kosta Harlan: special: Do not throw ErrorPageError from getRedirect() [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167)
[20:14:52] <MatmaRex>	 oops, i am late for my deployment
[20:15:14] <MatmaRex>	 anyone still around who could send out that little patch?
[20:15:53] <zabe>	 sure, cscott is currently doing theirs
[20:27:18] <wikibugs>	 (03Merged) 10jenkins-bot: skin: Omit "rendered with" phrase when the message is disabled [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) (owner: 10C. Scott Ananian)
[20:27:32] <logmsgbot>	 !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1166206|skin: Omit "rendered with" phrase when the message is disabled (T398616)]]
[20:27:35] <stashbot>	 T398616: New "renderedwith-legacy" string shows additional "-" (dash) at bottoms of all pages - https://phabricator.wikimedia.org/T398616
[20:29:32] <logmsgbot>	 !log cscott@deploy1003 cscott: Backport for [[gerrit:1166206|skin: Omit "rendered with" phrase when the message is disabled (T398616)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:30:31] <logmsgbot>	 !log cscott@deploy1003 cscott: Continuing with sync
[20:30:49] <cscott>	 my patch looks good on test servers.  i'm saving the world from a random - in the page footer
[20:31:30] <cscott>	 thanks mostly to MatmaRex 
[20:34:13] <MatmaRex>	 ;) i would have just let it ride. i'm surprised anyone noticed
[20:34:20] <MatmaRex>	 are deploys extra slow today?
[20:34:59] <MatmaRex>	 maybe we can ship both config changes together
[20:36:03] <logmsgbot>	 !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166206|skin: Omit "rendered with" phrase when the message is disabled (T398616)]] (duration: 08m 30s)
[20:36:06] <stashbot>	 T398616: New "renderedwith-legacy" string shows additional "-" (dash) at bottoms of all pages - https://phabricator.wikimedia.org/T398616
[20:36:15] <arlolra>	 MatmaRex: I can do that, sure
[20:36:20] <cscott>	 ok i'm done, who's next
[20:36:54] <arlolra>	 I'm going to deploy the two config changes
[20:37:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166236 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński)
[20:37:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166012 (https://phabricator.wikimedia.org/T390798) (owner: 10Arlolra)
[20:38:28] <wikibugs>	 (03Merged) 10jenkins-bot: Use FallbackContentHandler for undeployed JsonConfig content handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166236 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński)
[20:38:30] <wikibugs>	 (03Merged) 10jenkins-bot: ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166012 (https://phabricator.wikimedia.org/T390798) (owner: 10Arlolra)
[20:38:45] <logmsgbot>	 !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1166236|Use FallbackContentHandler for undeployed JsonConfig content handlers (T124748)]], [[gerrit:1166012|ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL (T390798 T389313)]]
[20:38:50] <stashbot>	 T124748: Deprecate Graph and Data namespaces on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748
[20:38:50] <stashbot>	 T390798: Mark REL1_44 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T390798
[20:38:51] <stashbot>	 T389313: Formally EOL MW 1.42 - https://phabricator.wikimedia.org/T389313
[20:39:32] <wikibugs>	 (03PS1) 10BryanDavis: gitlab: Allow WMCS runners to talk to deployment-prep wikis [puppet] - 10https://gerrit.wikimedia.org/r/1166262 (https://phabricator.wikimedia.org/T397591)
[20:39:34] <wikibugs>	 (03PS1) 10BryanDavis: gitlab: Allow WMCS runners to talk to puppet-enc.cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936)
[20:40:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gitlab: Allow WMCS runners to talk to puppet-enc.cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936) (owner: 10BryanDavis)
[20:40:42] <logmsgbot>	 !log arlolra@deploy1003 arlolra, matmarex: Backport for [[gerrit:1166236|Use FallbackContentHandler for undeployed JsonConfig content handlers (T124748)]], [[gerrit:1166012|ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL (T390798 T389313)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:41:26] <MatmaRex>	 my change looks good
[20:41:32] <logmsgbot>	 !log arlolra@deploy1003 arlolra, matmarex: Continuing with sync
[20:41:33] <MatmaRex>	 (tested at https://www.mediawiki.org/w/index.php?title=Data:Json:Wikicon&oldid=1435870&unhide=1 , no longer throws exceptions)
[20:41:52] <wikibugs>	 (03PS2) 10BryanDavis: gitlab: Allow WMCS runners to talk to puppet-enc.cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936)
[20:42:00] <arlolra>	 thanks
[20:46:55] <MatmaRex>	 thank you
[20:47:12] <logmsgbot>	 !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166236|Use FallbackContentHandler for undeployed JsonConfig content handlers (T124748)]], [[gerrit:1166012|ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL (T390798 T389313)]] (duration: 08m 27s)
[20:47:17] <stashbot>	 T124748: Deprecate Graph and Data namespaces on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748
[20:47:17] <stashbot>	 T390798: Mark REL1_44 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T390798
[20:47:17] <stashbot>	 T389313: Formally EOL MW 1.42 - https://phabricator.wikimedia.org/T389313
[20:50:04] <arlolra>	 kostajh: did you want me to deploy 1166178?
[20:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[20:51:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast1003.wikimedia.org
[20:53:16] <wikibugs>	 (03PS1) 10Zabe: Set categorylinks to read new on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166264 (https://phabricator.wikimedia.org/T397912)
[20:54:49] <zabe>	 I can do kostajh's together with mine
[20:54:56] <arlolra>	 ok
[20:55:03] <wikibugs>	 (03CR) 10Zabe: [V:03+2] special: Do not throw ErrorPageError from getRedirect() [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan)
[20:55:16] <wikibugs>	 (03CR) 10Zabe: [C:03+2] special: Do not throw ErrorPageError from getRedirect() [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan)
[20:55:28] <zabe>	 first was misclicked
[20:55:33] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166264 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[20:55:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast6003.wikimedia.org to drbd
[20:56:34] <wikibugs>	 (03Merged) 10jenkins-bot: Set categorylinks to read new on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166264 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[20:58:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast1003.wikimedia.org
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T2100)
[21:01:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan)
[21:07:44] <wikibugs>	 (03Merged) 10jenkins-bot: special: Do not throw ErrorPageError from getRedirect() [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan)
[21:08:02] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1166178|special: Do not throw ErrorPageError from getRedirect() (T398167)]], [[gerrit:1166264|Set categorylinks to read new on small wikis (T397912)]]
[21:08:06] <stashbot>	 T398167: MediaWiki\Exception\UserNotLoggedIn: Please log in to be able to access this page or action. - https://phabricator.wikimedia.org/T398167
[21:08:06] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[21:09:54] <logmsgbot>	 !log zabe@deploy1003 kharlan, zabe: Backport for [[gerrit:1166178|special: Do not throw ErrorPageError from getRedirect() (T398167)]], [[gerrit:1166264|Set categorylinks to read new on small wikis (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:11:12] <logmsgbot>	 !log zabe@deploy1003 kharlan, zabe: Continuing with sync
[21:16:39] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166178|special: Do not throw ErrorPageError from getRedirect() (T398167)]], [[gerrit:1166264|Set categorylinks to read new on small wikis (T397912)]] (duration: 08m 37s)
[21:16:43] <stashbot>	 T398167: MediaWiki\Exception\UserNotLoggedIn: Please log in to be able to access this page or action. - https://phabricator.wikimedia.org/T398167
[21:16:44] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[21:16:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast6003.wikimedia.org to drbd
[21:17:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet
[21:18:06] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10973729 (10ops-monitoring-bot) Draining ganeti6001.drmrs.wmnet of running VMs
[21:18:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6001.drmrs.wmnet
[21:19:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast6003.wikimedia.org to plain
[21:21:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast6003.wikimedia.org to plain
[21:23:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[21:29:35] <kostajh>	 zabe: thanks for syncing it
[21:29:44] <zabe>	 yw
[21:34:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:45:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#10973741 (10VRiley-WMF)
[21:45:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#10973742 (10VRiley-WMF) Added and updated location of the units in netbox.
[21:46:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:00:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[22:05:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[23:00:16] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Use FallbackContentHandler for another undeployed content handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166286 (https://phabricator.wikimedia.org/T124748)
[23:28:32] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[23:38:37] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166288
[23:38:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166288 (owner: 10TrainBranchBot)
[23:46:27] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 83708MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[23:53:15] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166288 (owner: 10TrainBranchBot)
[23:55:41] <wikibugs>	 (03PS2) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936)