[00:03:34] !log reprepro include php-apcu_5.1.24-1+wmf11u1 in component/php83 - T398245 [00:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:37] T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245 [00:08:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166049 [00:08:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166049 (owner: 10TrainBranchBot) [00:08:41] !log reprepro include php-igbinary_3.2.16-4+wmf11u1 in component/php83 - T398245 [00:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:44] T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245 [00:09:40] !log reprepro include php-msgpack_3.0.0-1+wmf11u1 in component/php83 - T398245 [00:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:36] (03PS1) 10DDesouza: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166051 [00:21:13] (03CR) 10DDesouza: [C:03+2] miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166051 (owner: 10DDesouza) [00:22:42] (03CR) 10Dzahn: [C:03+2] add initial blubber .pipeline config and a README [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166044 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [00:23:11] (03Merged) 10jenkins-bot: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166051 (owner: 10DDesouza) [00:25:26] (03CR) 10Dzahn: [V:03+2 C:03+2] add initial blubber .pipeline config and a README [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166044 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [00:29:01] (03PS1) 10Dzahn: add blubber skeleton config, use base image nodejs [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166052 (https://phabricator.wikimedia.org/T268199) [00:32:01] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166049 (owner: 10TrainBranchBot) [00:35:39] (03CR) 10Dzahn: [V:03+2 C:03+2] add blubber skeleton config, use base image nodejs [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166052 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [00:36:28] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:37:21] 06SRE, 06Data-Platform-SRE: Suppress ATSBackendErrorsHigh for wdqs2009.codfw.wmnet - https://phabricator.wikimedia.org/T398523#10970526 (10Scott_French) 05Open→03Resolved a:03RKemper It has been over 1h since https://gerrit.wikimedia.org/r/1166016 was merged, and subsequent puppet runs on the prometh... [00:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [01:02:25] FIRING: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:27:25] RESOLVED: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:25] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:38] (03PS1) 10DDesouza: miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166059 (https://phabricator.wikimedia.org/T344471) [01:49:41] (03PS1) 10DDesouza: miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166060 (https://phabricator.wikimedia.org/T219903) [01:51:35] (03PS1) 10Andrew Bogott: maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 [01:51:45] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166060 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [01:51:46] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166059 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [01:53:47] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [01:53:53] (03CR) 10CI reject: [V:04-1] maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 (owner: 10Andrew Bogott) [01:53:56] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [01:53:58] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [01:54:10] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [01:54:11] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [01:54:17] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166060 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [01:54:18] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166059 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [01:54:25] RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:54:25] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [01:54:48] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [01:54:50] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [01:54:51] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [01:54:53] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [01:54:54] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [01:54:56] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [01:55:21] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [01:55:34] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [01:55:35] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [01:55:47] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [01:55:48] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [01:56:05] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [01:56:10] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [01:56:22] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [01:56:23] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [01:56:39] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [01:56:40] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [01:56:58] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [01:57:14] FIRING: [2x] ProbeDown: Service people1004:30443 has failed probes (http_design_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:58:10] (03PS2) 10Andrew Bogott: maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 [02:00:27] (03CR) 10CI reject: [V:04-1] maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 (owner: 10Andrew Bogott) [02:03:07] (03PS3) 10Andrew Bogott: maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 [02:05:27] (03CR) 10CI reject: [V:04-1] maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 (owner: 10Andrew Bogott) [02:06:19] (03PS4) 10Andrew Bogott: maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 [02:08:26] (03CR) 10CI reject: [V:04-1] maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 (owner: 10Andrew Bogott) [02:15:16] (03PS5) 10Andrew Bogott: maintain_dbusers: when encountering an invalid uid, log and continue [puppet] - 10https://gerrit.wikimedia.org/r/1166061 [02:19:26] (03CR) 10Andrew Bogott: "Raymond -- I'm not 100% convinced that this is better, it makes this more resilient for the particular dumb issue that I caused but may no" [puppet] - 10https://gerrit.wikimedia.org/r/1166061 (owner: 10Andrew Bogott) [03:06:53] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [03:18:42] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [03:22:24] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [03:28:32] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [03:38:22] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [04:29:29] 06SRE, 06Data-Engineering: Include accept-language header in turnilo/superset - https://phabricator.wikimedia.org/T398213#10970650 (10Joe) 05Open→03Resolved Thanks @JAllemandou @BTullis for the assistance! [04:39:28] PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [04:42:28] RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.008 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [04:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:21:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:22:51] (03PS1) 10MusikAnimal: codeFolding: fix folding [extensions/CodeMirror] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166067 (https://phabricator.wikimedia.org/T398430) [05:23:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CodeMirror] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166067 (https://phabricator.wikimedia.org/T398430) (owner: 10MusikAnimal) [05:50:04] (03PS5) 10Abijeet Patro: CX: Add virtual-cx-shared DatabaseVirtualDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) [05:56:52] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:57:15] FIRING: [2x] ProbeDown: Service people1004:30443 has failed probes (http_design_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0600) [06:00:05] marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0600). [06:24:30] (03CR) 10Slyngshede: [C:03+1] "LGTM. I am starting to think that maybe we should make a Cmnd_Alias for SpiderPig, but that's not supported in our Puppet code at the mome" [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [06:31:20] (03CR) 10Elukey: [C:03+1] kubernetes: improve naming [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [06:31:25] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:34:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6002.drmrs.wmnet to drbd [06:36:03] 06SRE, 06Editing-team, 06Fundraising-Backlog, 06Traffic-Icebox, and 5 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085#10970742 (10Wellverywell) On what exactly is this task stalled? AFAICS it was planned to be done in early 2020? [06:38:00] (03PS1) 10Elukey: pyrra: fix success ration config for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166068 [06:40:10] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:41:23] (03CR) 10Elukey: [C:03+2] pyrra: fix success ration config for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166068 (owner: 10Elukey) [06:41:45] (03PS2) 10Dzahn: microsites: refactor blackbox checks to use resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [06:43:03] (03CR) 10Dzahn: "Hey, thank you for this! sorry for the delay. I was out of office and just saw this today. (maybe also because it was marked as WIP) and n" [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [06:45:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6002.drmrs.wmnet to drbd [06:45:32] PROBLEM - Host durum6002 is DOWN: PING CRITICAL - Packet loss = 100% [06:45:40] RECOVERY - Host durum6002 is UP: PING OK - Packet loss = 0%, RTA = 87.57 ms [06:47:03] (03PS1) 10Elukey: pyrra: fix success-ratio's default regex [puppet] - 10https://gerrit.wikimedia.org/r/1166070 [06:47:32] PROBLEM - Bird Internet Routing Daemon on durum6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:47:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6002.wikimedia.org to drbd [06:49:10] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:49:32] RECOVERY - Bird Internet Routing Daemon on durum6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:49:34] (03CR) 10Elukey: [C:03+2] pyrra: fix success-ratio's default regex [puppet] - 10https://gerrit.wikimedia.org/r/1166070 (owner: 10Elukey) [06:49:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet [06:49:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10970754 (10ops-monitoring-bot) Draining ganeti6003.drmrs.wmnet of running VMs [06:50:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6003.drmrs.wmnet [06:50:41] jmm@cumin2002 changedisk (PID 158663) is awaiting input [06:56:10] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:56:25] RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:57:10] (03CR) 10Volans: [C:03+2] kubernetes: improve naming [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [06:57:15] (03PS1) 10Elukey: pyrra: fix-2 for the slo success-ratio's regex [puppet] - 10https://gerrit.wikimedia.org/r/1166071 [06:57:36] (03CR) 10Volans: [C:03+2] postinst: clear stale files [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165839 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [06:58:00] (03Merged) 10jenkins-bot: kubernetes: improve naming [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [06:58:21] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6128/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166071 (owner: 10Elukey) [06:58:30] (03PS1) 10Jelto: microsites::monitoring: update body_regex_matches [puppet] - 10https://gerrit.wikimedia.org/r/1166073 (https://phabricator.wikimedia.org/T398528) [06:58:32] (03Merged) 10jenkins-bot: postinst: clear stale files [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165839 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [07:00:04] (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: fix-2 for the slo success-ratio's regex [puppet] - 10https://gerrit.wikimedia.org/r/1166071 (owner: 10Elukey) [07:00:05] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0700). [07:00:05] musikanimal: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:42] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:59] o/ [07:01:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6002.wikimedia.org to drbd [07:01:46] PROBLEM - Host doh6002 is DOWN: PING CRITICAL - Packet loss = 100% [07:01:50] RECOVERY - Host doh6002 is UP: PING OK - Packet loss = 0%, RTA = 87.47 ms [07:02:59] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on prometheus6002.drmrs.wmnet with reason: switch disk type back to DRBD [07:03:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus6002.drmrs.wmnet to drbd [07:03:48] PROBLEM - Bird Internet Routing Daemon on doh6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:04:48] RECOVERY - Bird Internet Routing Daemon on doh6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:05:01] I'm hoping someone is on call today. My patch is small, but somewhat critical to get deployed before wmf.8 lands on group2 tomorrow [07:05:10] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:05:42] RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:07:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Lumen (2001:1900:2100::a99) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:08:10] musikanimal: can you self serve? [07:09:34] I am not trained to do deploys, no [07:10:06] (03CR) 10Ladsgroup: [C:03+2] codeFolding: fix folding [extensions/CodeMirror] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166067 (https://phabricator.wikimedia.org/T398430) (owner: 10MusikAnimal) [07:11:14] (03Merged) 10jenkins-bot: codeFolding: fix folding [extensions/CodeMirror] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166067 (https://phabricator.wikimedia.org/T398430) (owner: 10MusikAnimal) [07:12:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/4 (Transit: Lumen (442550278) {#1503}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:14:04] 10SRE-SLO: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534 (10elukey) 03NEW p:05Triage→03High [07:14:23] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1166067|codeFolding: fix folding (T398430)]] [07:14:26] T398430: Cursor does not update upon move when nesting unclosed tags in CM6 - https://phabricator.wikimedia.org/T398430 [07:15:27] FIRING: [3x] SLOMetricAbsent: wdqs-main-availability drmrs - https://slo.wikimedia.org/?search=wdqs-main-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:15:39] (03CR) 10Jelto: [C:03+2] microsites::monitoring: update body_regex_matches [puppet] - 10https://gerrit.wikimedia.org/r/1166073 (https://phabricator.wikimedia.org/T398528) (owner: 10Jelto) [07:16:44] !log ladsgroup@deploy1003 musikanimal, ladsgroup: Backport for [[gerrit:1166067|codeFolding: fix folding (T398430)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:16:53] (03PS1) 10Muehlenhoff: Unvendor jquery [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166075 (https://phabricator.wikimedia.org/T397696) [07:16:54] ^ [07:16:57] please verify [07:17:08] doing! [07:17:23] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2318.mgmt:22 - https://phabricator.wikimedia.org/T398536 (10phaultfinder) 03NEW [07:17:35] 10ops-codfw, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537 (10FCeratto-WMF) 03NEW [07:18:28] !log depooling cp7006 for requestctl debugging [07:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:29] 10ops-codfw, 06DC-Ops: Unresponsive management for thanos-fe2007.mgmt:22 - https://phabricator.wikimedia.org/T398538 (10phaultfinder) 03NEW [07:18:30] 10ops-codfw, 06DC-Ops: Unresponsive management for db2152.mgmt:22 - https://phabricator.wikimedia.org/T398539 (10phaultfinder) 03NEW [07:18:31] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2200.mgmt:22 - https://phabricator.wikimedia.org/T398540 (10phaultfinder) 03NEW [07:18:32] 10ops-codfw, 06DC-Ops: Unresponsive management for cirrussearch2115.mgmt:22 - https://phabricator.wikimedia.org/T398541 (10phaultfinder) 03NEW [07:18:33] 10ops-codfw, 06DC-Ops: Unresponsive management for arclamp2001.mgmt:22 - https://phabricator.wikimedia.org/T398543 (10phaultfinder) 03NEW [07:18:34] 10ops-codfw, 06DC-Ops: Unresponsive management for gerrit2002.mgmt:22 - https://phabricator.wikimedia.org/T398542 (10phaultfinder) 03NEW [07:18:37] 10ops-codfw, 06DC-Ops: Unresponsive management for conf2006.mgmt:22 - https://phabricator.wikimedia.org/T398546 (10phaultfinder) 03NEW [07:18:41] 10ops-codfw, 06DC-Ops: Unresponsive management for gerrit2003.mgmt:22 - https://phabricator.wikimedia.org/T398544 (10phaultfinder) 03NEW [07:18:45] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2192.mgmt:22 - https://phabricator.wikimedia.org/T398547 (10phaultfinder) 03NEW [07:18:50] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2220.mgmt:22 - https://phabricator.wikimedia.org/T398545 (10phaultfinder) 03NEW [07:18:54] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2193.mgmt:22 - https://phabricator.wikimedia.org/T398550 (10phaultfinder) 03NEW [07:18:58] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2216.mgmt:22 - https://phabricator.wikimedia.org/T398548 (10phaultfinder) 03NEW [07:19:02] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2226.mgmt:22 - https://phabricator.wikimedia.org/T398551 (10phaultfinder) 03NEW [07:19:06] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2319.mgmt:22 - https://phabricator.wikimedia.org/T398549 (10phaultfinder) 03NEW [07:19:10] 10ops-codfw, 06DC-Ops: Unresponsive management for mc-misc2002.mgmt:22 - https://phabricator.wikimedia.org/T398552 (10phaultfinder) 03NEW [07:19:14] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2218.mgmt:22 - https://phabricator.wikimedia.org/T398554 (10phaultfinder) 03NEW [07:19:18] 10ops-codfw, 06DC-Ops: Unresponsive management for maps2008.mgmt:22 - https://phabricator.wikimedia.org/T398553 (10phaultfinder) 03NEW [07:19:22] 10ops-codfw, 06DC-Ops: Unresponsive management for restbase2038.mgmt:22 - https://phabricator.wikimedia.org/T398555 (10phaultfinder) 03NEW [07:19:28] 10ops-codfw, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971019 (10FCeratto-WMF) [07:19:36] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2219.mgmt:22 - https://phabricator.wikimedia.org/T398556 (10phaultfinder) 03NEW [07:19:44] 10ops-codfw, 06DC-Ops: Unresponsive management for bast2003.mgmt:22 - https://phabricator.wikimedia.org/T398557 (10phaultfinder) 03NEW [07:19:48] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2329.mgmt:22 - https://phabricator.wikimedia.org/T398559 (10phaultfinder) 03NEW [07:19:50] PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [07:19:52] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2223.mgmt:22 - https://phabricator.wikimedia.org/T398558 (10phaultfinder) 03NEW [07:19:56] 10ops-codfw, 06DC-Ops: Unresponsive management for pc2016.mgmt:22 - https://phabricator.wikimedia.org/T398560 (10phaultfinder) 03NEW [07:20:00] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2201.mgmt:22 - https://phabricator.wikimedia.org/T398561 (10phaultfinder) 03NEW [07:20:04] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2320.mgmt:22 - https://phabricator.wikimedia.org/T398562 (10phaultfinder) 03NEW [07:20:08] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2330.mgmt:22 - https://phabricator.wikimedia.org/T398563 (10phaultfinder) 03NEW [07:20:12] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2217.mgmt:22 - https://phabricator.wikimedia.org/T398564 (10phaultfinder) 03NEW [07:20:16] 10ops-codfw, 06DC-Ops: Unresponsive management for db2173.mgmt:22 - https://phabricator.wikimedia.org/T398565 (10phaultfinder) 03NEW [07:20:20] 10ops-codfw, 06DC-Ops: Unresponsive management for puppetdb2003.mgmt:22 - https://phabricator.wikimedia.org/T398567 (10phaultfinder) 03NEW [07:20:21] XioNoX: topranks [07:20:24] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2225.mgmt:22 - https://phabricator.wikimedia.org/T398566 (10phaultfinder) 03NEW [07:20:28] 10ops-codfw, 06DC-Ops: Unresponsive management for db2182.mgmt:22 - https://phabricator.wikimedia.org/T398568 (10phaultfinder) 03NEW [07:20:32] 10ops-codfw, 06DC-Ops: Unresponsive management for es2044.mgmt:22 - https://phabricator.wikimedia.org/T398569 (10phaultfinder) 03NEW [07:20:36] 10ops-codfw, 06DC-Ops: Unresponsive management for es2040.mgmt:22 - https://phabricator.wikimedia.org/T398570 (10phaultfinder) 03NEW [07:20:40] 10ops-codfw, 06DC-Ops: Unresponsive management for db2213.mgmt:22 - https://phabricator.wikimedia.org/T398571 (10phaultfinder) 03NEW [07:20:44] 10ops-codfw, 06DC-Ops: Unresponsive management for aux-k8s-worker2009.mgmt:22 - https://phabricator.wikimedia.org/T398572 (10phaultfinder) 03NEW [07:20:50] RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 16 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [07:20:55] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2227.mgmt:22 - https://phabricator.wikimedia.org/T398574 (10phaultfinder) 03NEW [07:20:59] 10ops-codfw, 06DC-Ops: Unresponsive management for db2181.mgmt:22 - https://phabricator.wikimedia.org/T398573 (10phaultfinder) 03NEW [07:21:03] (03CR) 10Volans: [C:03+1] "LGTM, one question inline" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166075 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:21:04] Amir1: OK confirmed! looks good :) [07:21:07] 10ops-codfw, 06DC-Ops: Unresponsive management for db2214.mgmt:22 - https://phabricator.wikimedia.org/T398575 (10phaultfinder) 03NEW [07:21:10] !log ladsgroup@deploy1003 musikanimal, ladsgroup: Continuing with sync [07:21:11] 10ops-codfw, 06DC-Ops: Unresponsive management for restbase2035.mgmt:22 - https://phabricator.wikimedia.org/T398576 (10phaultfinder) 03NEW [07:21:18] moving forward [07:21:19] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2222.mgmt:22 - https://phabricator.wikimedia.org/T398577 (10phaultfinder) 03NEW [07:21:20] thank you sooo much! [07:21:25] 10ops-codfw, 06DC-Ops: Unresponsive management for puppetserver2004.mgmt:22 - https://phabricator.wikimedia.org/T398578 (10phaultfinder) 03NEW [07:21:29] 10ops-codfw, 06DC-Ops: Unresponsive management for thanos-be2005.mgmt:22 - https://phabricator.wikimedia.org/T398579 (10phaultfinder) 03NEW [07:21:33] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2186.mgmt:22 - https://phabricator.wikimedia.org/T398582 (10phaultfinder) 03NEW [07:21:37] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2221.mgmt:22 - https://phabricator.wikimedia.org/T398581 (10phaultfinder) 03NEW [07:21:41] 10ops-codfw, 06DC-Ops: Unresponsive management for db2219.mgmt:22 - https://phabricator.wikimedia.org/T398580 (10phaultfinder) 03NEW [07:21:45] 10ops-codfw, 06DC-Ops: Unresponsive management for es2048.mgmt:22 - https://phabricator.wikimedia.org/T398583 (10phaultfinder) 03NEW [07:21:49] (03PS1) 10Elukey: pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) [07:21:53] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2198.mgmt:22 - https://phabricator.wikimedia.org/T398584 (10phaultfinder) 03NEW [07:21:57] 10ops-codfw, 06DC-Ops: Unresponsive management for db2174.mgmt:22 - https://phabricator.wikimedia.org/T398585 (10phaultfinder) 03NEW [07:22:01] 10ops-codfw, 06DC-Ops: Unresponsive management for db2220.mgmt:22 - https://phabricator.wikimedia.org/T398587 (10phaultfinder) 03NEW [07:22:05] 10ops-codfw, 06DC-Ops: Unresponsive management for db2195.mgmt:22 - https://phabricator.wikimedia.org/T398586 (10phaultfinder) 03NEW [07:22:09] (03CR) 10CI reject: [V:04-1] pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [07:22:18] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6129/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [07:22:47] yw ^_^ [07:22:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/4 (Transit: Lumen (442550278) {#1503}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:23:51] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10971251 (10Ladsgroup) >>! In T393296#10970263, @VRiley-WMF wrote: > We have received the Seed Server for this unit. Would we like to use a new/different name but set it up in the same location? Manu... [07:25:50] PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [07:26:30] (03PS2) 10Elukey: pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) [07:26:39] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166067|codeFolding: fix folding (T398430)]] (duration: 12m 16s) [07:26:42] T398430: Cursor does not update upon move when nesting unclosed tags in CM6 - https://phabricator.wikimedia.org/T398430 [07:27:03] musikanimal: fully deployed [07:27:18] you're the best! and I mean that [07:27:25] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6130/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [07:27:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Lumen (2001:1900:2100::a99) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:27:39] (03CR) 10Muehlenhoff: Unvendor jquery (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166075 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:28:32] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [07:28:44] <3 [07:29:05] (03CR) 10Arnaudb: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [07:29:50] RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 16 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [07:30:06] (03CR) 10Arnaudb: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [07:30:58] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 10MW-1.45-notes (1.45.0-wmf.8; 2025-07-01): Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10971315 (10elukey) a:05DLynch→03elukey Thanks, I see the metrics now in Prometheu... [07:34:46] !log upload php-excimer_1.2.5-1+wmf11u1 [07:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:41] (03CR) 10Volans: [C:03+1] "ship it" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166075 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:37:25] (03PS1) 10Giuseppe Lavagetto: Add makefile to manage the repo basic tasks [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166078 [07:37:25] (03PS1) 10Giuseppe Lavagetto: Code changes: * Search for reason in actions [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166079 [07:37:40] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add makefile to manage the repo basic tasks [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166078 (owner: 10Giuseppe Lavagetto) [07:37:50] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Code changes: * Search for reason in actions [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166079 (owner: 10Giuseppe Lavagetto) [07:38:35] (03PS1) 10Jelto: miscweb: bump all remaining miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166080 (https://phabricator.wikimedia.org/T398303) [07:38:40] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Feature: search in response reasons - oblivian@cumin1003" [07:38:41] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: search in response reasons - oblivian@cumin1003 [07:39:11] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: search in response reasons - oblivian@cumin1003 [07:39:12] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Feature: search in response reasons - oblivian@cumin1003" [07:39:25] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:41:08] (03CR) 10Muehlenhoff: [C:03+2] Unvendor jquery [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166075 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:41:17] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166080 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [07:42:05] !jouncebot now [07:42:05] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [07:42:12] jouncebot: now [07:42:13] For the next 0 hour(s) and 17 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0700) [07:42:16] jouncebot: next [07:42:17] In 0 hour(s) and 17 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0800) [07:42:28] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7006.magru.wmnet with reason: haproxy testing [07:44:50] (03PS1) 10Muehlenhoff: Depend on jquery and setup symlinks to the paths used by Debian [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166081 [07:44:51] (03CR) 10Jelto: [C:03+2] miscweb: bump all remaining miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166080 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [07:45:49] (03CR) 10Volans: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166081 (owner: 10Muehlenhoff) [07:46:26] (03PS1) 10Federico Ceratto: CAS: Add wmf group for Zarcillo, remove ops [puppet] - 10https://gerrit.wikimedia.org/r/1166082 (https://phabricator.wikimedia.org/T395304) [07:46:27] (03CR) 10Federico Ceratto: "This CR reintroduces CR 1161873 as discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1166082 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto) [07:46:48] (03Merged) 10jenkins-bot: miscweb: bump all remaining miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166080 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [07:47:15] FIRING: [2x] ProbeDown: Service people1004:30443 has failed probes (http_design_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:49:38] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [07:50:08] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [07:51:54] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [07:52:14] RESOLVED: [2x] ProbeDown: Service people1004:30443 has failed probes (http_design_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:52:21] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [07:52:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repool pc4 T378715', diff saved to https://phabricator.wikimedia.org/P78744 and previous config saved to /var/cache/conftool/dbconfig/20250703-075225-ladsgroup.json [07:52:28] T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715 [07:53:05] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [07:53:35] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [07:53:57] (03CR) 10Muehlenhoff: [C:03+2] Depend on jquery and setup symlinks to the paths used by Debian [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166081 (owner: 10Muehlenhoff) [07:55:54] (03PS1) 10Muehlenhoff: Add a lintian override for a false positive around the use of Bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166123 (https://phabricator.wikimedia.org/T397696) [07:56:44] (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166123 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:57:15] (03CR) 10Muehlenhoff: [C:03+2] Add a lintian override for a false positive around the use of Bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166123 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [08:00:05] jnuche and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0800). [08:01:00] morning, train is happening in 5m [08:03:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus6002.drmrs.wmnet to drbd [08:03:42] PROBLEM - Host prometheus6002 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:34] RECOVERY - Host prometheus6002 is UP: PING OK - Packet loss = 0%, RTA = 87.50 ms [08:05:43] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166129 (https://phabricator.wikimedia.org/T392178) [08:05:44] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166129 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot) [08:06:34] (03PS1) 10Volans: Upstream release v0.6.5 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166130 [08:06:37] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166129 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot) [08:06:44] (03CR) 10Volans: [C:03+2] Upstream release v0.6.5 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166130 (owner: 10Volans) [08:07:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install6002.wikimedia.org to plain [08:07:36] (03Merged) 10jenkins-bot: Upstream release v0.6.5 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166130 (owner: 10Volans) [08:08:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install6002.wikimedia.org to plain [08:08:58] (03Abandoned) 10Majavah: P:toolforge::grid: disable webservicemonitor [puppet] - 10https://gerrit.wikimedia.org/r/888347 (https://phabricator.wikimedia.org/T329467) (owner: 10Majavah) [08:10:27] RESOLVED: [3x] SLOMetricAbsent: wdqs-main-availability drmrs - https://slo.wikimedia.org/?search=wdqs-main-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:13:03] !log uploaded debmonitor-server,python3-debmonitor_0.6.5 to apt.wikimedia.org bookworm-wikimedia [08:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:56] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588 (10Shisma) 03NEW [08:14:36] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.8 refs T392178 [08:14:39] T392178: 1.45.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T392178 [08:15:34] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:18:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6001.wikimedia.org to plain [08:20:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6001.wikimedia.org to plain [08:21:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6001.drmrs.wmnet to plain [08:22:10] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:23:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6001.drmrs.wmnet to plain [08:23:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6001.drmrs.wmnet to plain [08:25:24] (03PS1) 10Volans: kubernetes: show also the image OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) [08:25:32] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971395 (10Ladsgroup) This is master of s5 in codfw. @FCeratto-WMF Can you do a switchover of s5 in codfw ASAP? If this goes down, we lose the whole section in codfw. [08:25:46] PROBLEM - Bird Internet Routing Daemon on durum6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:26:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6001.drmrs.wmnet to plain [08:26:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast6003.wikimedia.org to plain [08:27:10] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:27:48] RECOVERY - Bird Internet Routing Daemon on durum6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:28:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast6003.wikimedia.org to plain [08:29:29] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti6003.drmrs.wmnet with reason: reimage [08:32:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti6003.drmrs.wmnet with OS bookworm [08:32:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10971418 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6003.drmrs.wmnet with OS bookworm [08:35:56] (03CR) 10Ladsgroup: CX: Add virtual-cx-shared DatabaseVirtualDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [08:36:36] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet [08:37:13] FYI, ml-etcd1002 and dse-k8s-etcd1002 will briefly go down for a Ganeti node reboot [08:37:22] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1048.eqiad.wmnet [08:39:03] (03PS1) 10Zabe: Use correct index on categorylinks [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166133 (https://phabricator.wikimedia.org/T385890) [08:39:20] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:50] PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:40:26] RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [08:40:30] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [08:42:35] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1048.eqiad.wmnet [08:42:43] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1048.eqiad.wmnet [08:48:40] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host krb1002.eqiad.wmnet [08:50:08] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti6003.drmrs.wmnet with reason: host reimage [08:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [08:53:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti6003.drmrs.wmnet with reason: host reimage [08:53:16] (03CR) 10Volans: [C:03+1] "Looks good! I left just very minor optional things, no need to re-review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [08:54:18] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1002.eqiad.wmnet [08:57:22] (03CR) 10Filippo Giunchedi: "LGTM overall, a few questions below:" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [08:58:40] (03PS6) 10Urbanecm: [Growth] Remove support code for Surfacing Structured Tasks experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163028 (https://phabricator.wikimedia.org/T397515) [08:58:45] (03PS3) 10Urbanecm: [Growth] Remove feature flags related to Surfacing Structured Tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163288 (https://phabricator.wikimedia.org/T397515) [08:59:25] (03PS1) 10Vgutierrez: hiera: Switch lvs5006 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1166134 (https://phabricator.wikimedia.org/T396561) [09:01:39] (03PS1) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) [09:02:36] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166134 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:02:38] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6131/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [09:04:53] (03CR) 10Volans: "LGTM but I have a question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [09:06:43] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1166138 (https://phabricator.wikimedia.org/T398593) [09:06:48] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1166139 (https://phabricator.wikimedia.org/T398593) [09:08:08] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1166134 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:08:24] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1166140 (https://phabricator.wikimedia.org/T398594) [09:09:57] (03PS3) 10Elukey: pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) [09:09:57] (03PS2) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) [09:09:57] (03PS1) 10Elukey: pyrra: fix k8s cluster name for the revertrisk SLO [puppet] - 10https://gerrit.wikimedia.org/r/1166141 [09:10:53] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6132/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [09:11:49] (03CR) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [09:13:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:13:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6003.drmrs.wmnet with OS bookworm [09:13:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10971538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6003.drmrs.wmnet with OS bookworm completed: - ganeti6003 (**PASS*... [09:13:22] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:14:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:14:39] (03CR) 10Elukey: [C:03+2] pyrra: fix k8s cluster name for the revertrisk SLO [puppet] - 10https://gerrit.wikimedia.org/r/1166141 (owner: 10Elukey) [09:15:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:15:31] (03CR) 10Volans: [C:03+1] "post-merge optional suggestion" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [09:16:49] (03CR) 10A smart kitten: ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166012 (https://phabricator.wikimedia.org/T390798) (owner: 10Arlolra) [09:17:25] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.0 [puppet] - 10https://gerrit.wikimedia.org/r/1166142 (https://phabricator.wikimedia.org/T394382) [09:17:55] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971548 (10FCeratto-WMF) p:05Triage→03High a:03FCeratto-WMF [09:18:26] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971551 (10FCeratto-WMF) [09:19:18] (03CR) 10Volans: [C:03+1] reimage: add dhcp MAC address support for physical hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [09:19:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:21:04] !log fceratto@cumin1002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T398593 [09:21:07] T398593: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T398593 [09:21:57] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1004.eqiad.wmnet with reason: Maintenance and reboot [09:22:44] jouncebot: now [09:22:44] For the next 0 hour(s) and 37 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0800) [09:23:05] Amir1: you did a backport just now and the train is done to my undrstanding ? [09:23:15] (03CR) 10Volans: "minor comment inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [09:24:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T398594 [09:24:21] T398594: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T398594 [09:25:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2192 from API/vslow/dump T398594', diff saved to https://phabricator.wikimedia.org/P78745 and previous config saved to /var/cache/conftool/dbconfig/20250703-092522-fceratto.json [09:27:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:27:22] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:27:26] (03PS2) 10Majavah: natlog: Add explicit dependency to file_line [puppet] - 10https://gerrit.wikimedia.org/r/1165921 (https://phabricator.wikimedia.org/T273734) [09:27:26] (03PS1) 10Majavah: hieradata: Enable hourly logrotate on codfw cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1166145 (https://phabricator.wikimedia.org/T273734) [09:27:29] (03PS1) 10Majavah: hieradata: Enable hourly logrotate in all cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1166146 (https://phabricator.wikimedia.org/T273734) [09:27:37] I did the backport long time ago, I don't know about the train xD [09:28:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:29:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:30:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:30:21] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7006.magru.wmnet [09:30:21] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7006.magru.wmnet [09:30:33] (03CR) 10Majavah: [C:03+2] natlog: Add explicit dependency to file_line [puppet] - 10https://gerrit.wikimedia.org/r/1165921 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [09:30:44] (03CR) 10Majavah: [C:03+2] hieradata: Enable hourly logrotate on codfw cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1166145 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [09:31:07] !log repooling cp7006 [09:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:24] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1166140 (https://phabricator.wikimedia.org/T398594) (owner: 10Gerrit maintenance bot) [09:32:30] taavi: you might see a pending puppet change from me that needs merging [09:32:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598 (10cmooney) 03NEW p:05Triage→03High [09:32:50] ok I was able to merge it after you [09:32:53] federico3: that didn't come up for me, up to you to merge [09:33:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:34:25] FIRING: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10971628 (10cmooney) [09:34:45] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2186.mgmt:22 - https://phabricator.wikimedia.org/T398582#10971629 (10cmooney) [09:34:46] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2221.mgmt:22 - https://phabricator.wikimedia.org/T398581#10971630 (10cmooney) [09:34:48] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2219.mgmt:22 - https://phabricator.wikimedia.org/T398580#10971631 (10cmooney) [09:34:50] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for thanos-be2005.mgmt:22 - https://phabricator.wikimedia.org/T398579#10971632 (10cmooney) [09:34:52] !log Starting s5 codfw failover from db2213 to db2192 - T398594 [09:34:54] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2181.mgmt:22 - https://phabricator.wikimedia.org/T398573#10971633 (10cmooney) [09:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:55] T398594: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T398594 [09:34:58] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for aux-k8s-worker2009.mgmt:22 - https://phabricator.wikimedia.org/T398572#10971634 (10cmooney) [09:35:02] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2213.mgmt:22 - https://phabricator.wikimedia.org/T398571#10971635 (10cmooney) [09:35:06] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2040.mgmt:22 - https://phabricator.wikimedia.org/T398570#10971636 (10cmooney) [09:35:10] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2044.mgmt:22 - https://phabricator.wikimedia.org/T398569#10971637 (10cmooney) [09:35:14] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2182.mgmt:22 - https://phabricator.wikimedia.org/T398568#10971638 (10cmooney) [09:35:18] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for puppetdb2003.mgmt:22 - https://phabricator.wikimedia.org/T398567#10971639 (10cmooney) [09:35:22] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2173.mgmt:22 - https://phabricator.wikimedia.org/T398565#10971640 (10cmooney) [09:35:26] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2217.mgmt:22 - https://phabricator.wikimedia.org/T398564#10971641 (10cmooney) [09:35:30] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2320.mgmt:22 - https://phabricator.wikimedia.org/T398562#10971643 (10cmooney) [09:35:34] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2330.mgmt:22 - https://phabricator.wikimedia.org/T398563#10971642 (10cmooney) [09:35:38] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for pc2016.mgmt:22 - https://phabricator.wikimedia.org/T398560#10971645 (10cmooney) [09:35:42] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2201.mgmt:22 - https://phabricator.wikimedia.org/T398561#10971644 (10cmooney) [09:35:46] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2329.mgmt:22 - https://phabricator.wikimedia.org/T398559#10971646 (10cmooney) [09:35:50] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for bast2003.mgmt:22 - https://phabricator.wikimedia.org/T398557#10971647 (10cmooney) [09:35:54] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2219.mgmt:22 - https://phabricator.wikimedia.org/T398556#10971649 (10cmooney) [09:36:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2192 to s5 primary T398594', diff saved to https://phabricator.wikimedia.org/P78746 and previous config saved to /var/cache/conftool/dbconfig/20250703-093612-fceratto.json [09:38:05] (03PS1) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) [09:39:01] (03CR) 10Volans: "LGTM, very minor suggestions inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [09:39:03] (03PS2) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) [09:39:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2213 weights T398594', diff saved to https://phabricator.wikimedia.org/P78747 and previous config saved to /var/cache/conftool/dbconfig/20250703-093943-fceratto.json [09:40:01] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6134/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [09:40:18] (03CR) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [09:40:26] federico3: don't switchover eqiad please, that requires read only down time and it's not needed [09:40:31] (03CR) 10Volans: reimage: add support for using the host UUID for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [09:40:56] Amir1: yes, I'm not touching eq, only cod [09:41:22] should I close T398593 then? [09:41:22] T398593: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T398593 [09:42:54] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971674 (10FCeratto-WMF) db2213 has been flipped, now it's a candidate master [09:45:15] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10971693 (10Ladsgroup) Thanks! I'd wait for dc ops. it's a bit high prio since it's candidate master but less prio because it's not a master now [09:48:08] 10ops-codfw, 06DBA, 06DC-Ops: PSU issue on es2044 - https://phabricator.wikimedia.org/T398601 (10FCeratto-WMF) 03NEW [09:50:25] (03CR) 10Volans: "Great to see some additional tests, thanks! Few typos an a simplification suggestion inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi) [09:51:14] (03Abandoned) 10Ladsgroup: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1166139 (https://phabricator.wikimedia.org/T398593) (owner: 10Gerrit maintenance bot) [09:51:34] (03Abandoned) 10Ladsgroup: mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1166138 (https://phabricator.wikimedia.org/T398593) (owner: 10Gerrit maintenance bot) [09:59:25] RESOLVED: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:04] effie: How many deployers does it take to do MediaWiki infrastructure (UTC mid-day) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1000). [10:00:21] !log upgrading production debmonitor-server to the latest v0.6.5 [10:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:33] 1 is the loneliest number [10:01:43] (03CR) 10Volans: [C:03+2] debmonitor: add link to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/1164999 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [10:01:55] (03CR) 10Volans: [C:03+2] debmonitor: use the new endpoint for the check [puppet] - 10https://gerrit.wikimedia.org/r/1164485 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [10:03:02] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/Debmonitor [10:03:22] downtime didn't arrive in time, sorry [10:03:32] FIRING: ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:04:35] FIRING: [4x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:05:42] !log volans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on debmonitor2003.codfw.wmnet,debmonitor1003.eqiad.wmnet,debmonitor-dev2001.codfw.wmnet with reason: deploy new version [10:07:19] (03PS1) 10Effie Mouzeli: php8.1: rebuild images to pick up excimer 1.2.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165594 (https://phabricator.wikimedia.org/T397907) (owner: 10Scott French) [10:07:27] (03CR) 10Effie Mouzeli: [C:03+2] php8.1: rebuild images to pick up excimer 1.2.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165594 (https://phabricator.wikimedia.org/T397907) (owner: 10Scott French) [10:07:51] (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] php8.1: rebuild images to pick up excimer 1.2.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165594 (https://phabricator.wikimedia.org/T397907) (owner: 10Scott French) [10:08:24] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetserver2001.codfw.wmnet [10:08:29] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:12:11] (03PS1) 10Samtar: InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166155 (https://phabricator.wikimedia.org/T377978) [10:12:13] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2001.codfw.wmnet [10:16:42] 06SRE, 06Infrastructure-Foundations: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs - https://phabricator.wikimedia.org/T398464#10971808 (10Volans) Totally agree there is no point. For the `idrac` one the only potential use case would be to match it with our existing `mgmt` b... [10:17:01] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#10971813 (10Volans) Option 2 LGTM too [10:20:26] 06SRE, 06Infrastructure-Foundations: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs - https://phabricator.wikimedia.org/T398464#10971830 (10cmooney) >>! In T398464#10971808, @Volans wrote: > Totally agree there is no point. For the `idrac` one the only potential use case woul... [10:20:50] (03PS1) 10Clément Goubert: team-sre/mw-cron: Fix description command [alerts] - 10https://gerrit.wikimedia.org/r/1166156 [10:21:51] (03CR) 10Kosta Harlan: [C:03+1] team-sre/mw-cron: Fix description command [alerts] - 10https://gerrit.wikimedia.org/r/1166156 (owner: 10Clément Goubert) [10:22:36] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#10971847 (10cmooney) [10:23:36] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1005.eqiad.wmnet with reason: Maintenance and reboot [10:23:42] 06SRE, 06Infrastructure-Foundations: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs - https://phabricator.wikimedia.org/T398464#10971849 (10Volans) Ack, let's do both: disable it in the bios and skip it in the import [10:24:02] (03CR) 10Clément Goubert: [C:03+2] team-sre/mw-cron: Fix description command [alerts] - 10https://gerrit.wikimedia.org/r/1166156 (owner: 10Clément Goubert) [10:25:11] (03Merged) 10jenkins-bot: team-sre/mw-cron: Fix description command [alerts] - 10https://gerrit.wikimedia.org/r/1166156 (owner: 10Clément Goubert) [10:26:14] (03PS1) 10Volans: debmonitor: use the new endpoint for checks [puppet] - 10https://gerrit.wikimedia.org/r/1166158 (https://phabricator.wikimedia.org/T397696) [10:26:48] !log jiji@deploy1003 Started scap sync-world: T397907 - Upgrade Excimer to 1.2.5 in production [10:26:52] T397907: Upgrade Excimer to 1.2.5 in production - https://phabricator.wikimedia.org/T397907 [10:27:13] 06SRE, 10decommission-hardware, 06Infrastructure-Foundations: decommission puppetserver2003 - https://phabricator.wikimedia.org/T398607 (10MoritzMuehlenhoff) 03NEW [10:29:37] (03PS1) 10Muehlenhoff: Remove puppetserver2003 from serving requests [dns] - 10https://gerrit.wikimedia.org/r/1166159 (https://phabricator.wikimedia.org/T398607) [10:33:03] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor2003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 578 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [10:37:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1166158 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [10:37:16] (03CR) 10Volans: [C:03+2] debmonitor: use the new endpoint for checks [puppet] - 10https://gerrit.wikimedia.org/r/1166158 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [10:42:17] !log jiji@deploy1003 Stopping before sync operations [10:42:29] 06SRE, 07SRE-Unowned, 10Deployments, 06Release-Engineering-Team: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10971949 (10hnowlan) In recent weeks I've noticed more and more tendency to say "let's move to -sre" when an incident begins or lots of coordin... [10:43:33] !log jiji@deploy1003 Locking from deployment [ALL REPOSITORIES]: T397907 - Upgrade Excimer to 1.2.5 in production in progress, blocking deploys [10:43:36] T397907: Upgrade Excimer to 1.2.5 in production - https://phabricator.wikimedia.org/T397907 [10:44:01] (03PS1) 10Muehlenhoff: Remove puppetserver2003 from active Puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1166160 (https://phabricator.wikimedia.org/T398607) [10:44:03] (03PS1) 10Muehlenhoff: Remove puppetserver role frm puppetserver2003 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607) [10:44:23] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [10:44:41] (03Abandoned) 10Hnowlan: admin_ng: increase limits for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160767 (owner: 10Hnowlan) [10:45:04] (03Abandoned) 10Hnowlan: Revert "changeprop: Remove rules related to parsoid (RB sunset)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159535 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:47:38] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:48:55] (03CR) 10Samwilson: [C:03+1] InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166155 (https://phabricator.wikimedia.org/T377978) (owner: 10Samtar) [10:49:51] !log starting staged rollout of Excimer to 1.2.5 mw-debug first, mw-api-int next [10:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:26] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [10:51:59] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:54:19] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:54:35] jouncebot: nowandnext [10:54:36] For the next 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1000) [10:54:36] In 1 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1200) [10:55:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166155 (https://phabricator.wikimedia.org/T377978) (owner: 10Samtar) [10:57:05] TheresNoTime: I am running a deployment that will take quite a long time [10:57:38] effie: no worries, I've scheduled what I was going to do for an actual backport window :) [10:57:46] cool! [10:59:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612 (10cmooney) 03NEW p:05Triage→03High [11:01:09] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [11:03:32] RESOLVED: [4x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:03:40] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:04:12] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:04:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet [11:05:23] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [11:05:30] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:05:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10972002 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_... [11:06:25] FIRING: SystemdUnitFailed: user@499.service on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:30] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:07:43] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:10:54] cmooney@cumin1003 netbox (PID 101345) is awaiting input [11:11:53] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entry for rgw.codfw.dpe.anycast.wmnet - cmooney@cumin1003" [11:11:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entry for rgw.codfw.dpe.anycast.wmnet - cmooney@cumin1003" [11:11:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:12:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet [11:13:43] (03CR) 10Effie Mouzeli: [C:03+1] service: add discovery active/active config [puppet] - 10https://gerrit.wikimedia.org/r/1164458 (https://phabricator.wikimedia.org/T397618) (owner: 10Hnowlan) [11:15:08] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:15:48] !log starting staged rollout of Excimer to 1.2.5, mw-api-ext [11:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:54] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:16:20] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:17:35] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1004.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002 [11:18:08] (03PS1) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) [11:18:34] (03CR) 10CI reject: [V:04-1] cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [11:20:08] (03PS2) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) [11:21:04] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:21:10] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:21:54] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:24:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10972037 (10Stevemunene) 05Open→03Resolved >>! In T390176#10967599, @Jclark-ctr wrote: > @... [11:24:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10972040 (10Stevemunene) [11:25:03] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [11:26:36] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:26:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613 (10Stevemunene) 03NEW [11:27:49] !log jiji@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: T397907 - Upgrade Excimer to 1.2.5 in production in progress, blocking deploys (duration: 44m 16s) [11:27:52] T397907: Upgrade Excimer to 1.2.5 in production - https://phabricator.wikimedia.org/T397907 [11:28:32] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [11:29:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti6003.drmrs.wmnet to cluster drmrs01 and group B12 [11:30:00] !log jiji@deploy1003 Started scap sync-world: T397907 - Upgrade Excimer to 1.2.5 in production [11:31:25] RESOLVED: SystemdUnitFailed: user@499.service on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:32:46] jmm@cumin2002 addnode (PID 260760) is awaiting input [11:33:49] (03PS3) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) [11:35:01] (03PS1) 10Muehlenhoff: Remove jquery [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166168 [11:35:06] !log jiji@deploy1003 Finished scap sync-world: T397907 - Upgrade Excimer to 1.2.5 in production (duration: 06m 59s) [11:35:09] T397907: Upgrade Excimer to 1.2.5 in production - https://phabricator.wikimedia.org/T397907 [11:36:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10972104 (10MoritzMuehlenhoff) [11:36:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [11:37:12] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1005.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002 [11:37:22] (03CR) 10Muehlenhoff: kubernetes: show also the image OS (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [11:37:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti6003.drmrs.wmnet to cluster drmrs01 and group B12 [11:38:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install6002.wikimedia.org to drbd [11:38:48] (03PS2) 10Muehlenhoff: Remove jquery [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166168 [11:40:31] (03CR) 10Volans: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166168 (owner: 10Muehlenhoff) [11:41:02] (03PS2) 10Volans: kubernetes: show also the image OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) [11:41:05] (03CR) 10Volans: kubernetes: show also the image OS (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [11:41:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [11:41:58] (03CR) 10Muehlenhoff: [C:03+2] Remove jquery [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166168 (owner: 10Muehlenhoff) [11:45:09] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [11:45:52] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [11:48:42] FIRING: JobUnavailable: Reduced availability for job squid in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:51:35] (03PS3) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) [11:53:48] (03CR) 10Arnaudb: [C:03+1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.0 [puppet] - 10https://gerrit.wikimedia.org/r/1166142 (https://phabricator.wikimedia.org/T394382) (owner: 10Jelto) [11:55:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:56:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:56:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install6002.wikimedia.org to drbd [11:56:33] PROBLEM - Host install6002 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:39] RECOVERY - Host install6002 is UP: PING OK - Packet loss = 0%, RTA = 87.38 ms [11:57:40] (03PS1) 10Stevemunene: hdfs: set an-worker1176 and 1179 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1166170 (https://phabricator.wikimedia.org/T398027) [11:58:42] RESOLVED: JobUnavailable: Reduced availability for job squid in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1200) [12:11:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165635 (https://phabricator.wikimedia.org/T398137) (owner: 10EggRoll97) [12:13:38] (03PS1) 10Cathal Mooney: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) [12:13:42] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:13:53] (03PS2) 10EggRoll97: Allow abusefilter block action on plwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165635 (https://phabricator.wikimedia.org/T398137) [12:13:55] (03PS2) 10Cathal Mooney: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) [12:14:43] (03PS1) 10Kosta Harlan: special: Do not throw ErrorPageError from getRedirect() [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) [12:15:14] jnuche: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1166178 should fix the logspam related to T398167 [12:15:14] T398167: MediaWiki\Exception\UserNotLoggedIn: Please log in to be able to access this page or action. - https://phabricator.wikimedia.org/T398167 [12:15:25] are you able to deploy it? [12:15:46] (03CR) 10CI reject: [V:04-1] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [12:16:28] kostajh: yeah, but I'd feel more comfortable if someone else can take a look at it before we deploy it [12:18:05] jnuche: I can have a look once it's staged on mwdebug [12:18:42] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:19:06] (03PS3) 10Cathal Mooney: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) [12:19:25] kostajh: that sounds good, also I see that's already the cherry pick [12:19:34] I'm ok with deploying [12:19:41] let's do it [12:20:15] cool [12:20:26] (03CR) 10CI reject: [V:04-1] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [12:20:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan) [12:22:30] (03CR) 10Volans: [C:03+2] kubernetes: show also the image OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [12:23:33] (03Merged) 10jenkins-bot: kubernetes: show also the image OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166131 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [12:23:42] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:27:28] (03CR) 10CI reject: [V:04-1] special: Do not throw ErrorPageError from getRedirect() [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan) [12:27:36] (03PS1) 10Volans: Revert "Remove jquery" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166185 [12:28:38] kostajh: patch got a test failure [12:28:43] (03PS1) 10Mvolz: Update parameter name for wikidata/citoid integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166186 (https://phabricator.wikimedia.org/T361576) [12:29:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:29:31] (03CR) 10CI reject: [V:04-1] Update parameter name for wikidata/citoid integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166186 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [12:29:33] (03PS2) 10Mvolz: Update parameter name for wikidata/citoid integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166186 (https://phabricator.wikimedia.org/T361576) [12:29:41] jnuche: hmm [12:29:44] (03PS1) 10Volans: JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696) [12:30:00] (03CR) 10Volans: [C:03+2] Revert "Remove jquery" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166185 (owner: 10Volans) [12:30:22] (03CR) 10CI reject: [V:04-1] Update parameter name for wikidata/citoid integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166186 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [12:30:23] jnuche: that is unrelated, seems like a flaky test [12:30:59] (03Merged) 10jenkins-bot: Revert "Remove jquery" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166185 (owner: 10Volans) [12:32:02] (03CR) 10Jaime Nuche: "recheck" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166178 (https://phabricator.wikimedia.org/T398167) (owner: 10Kosta Harlan) [12:32:29] (03PS1) 10Volans: JS links: fix jquery version for bookworm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166192 (https://phabricator.wikimedia.org/T397696) [12:32:47] kostajh: ack, retrying [12:33:40] (03PS2) 10Volans: JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696) [12:34:25] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:45] (03PS1) 10Effie Mouzeli: hieradata: migrate memcached gutter pool to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1166194 (https://phabricator.wikimedia.org/T398611) [12:36:59] (03PS3) 10Volans: JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696) [12:38:37] 10SRE-SLO, 13Patch-For-Review: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534#10972370 (10elukey) a:03elukey [12:38:42] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:38:50] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#10972371 (10elukey) a:03herron [12:39:25] FIRING: [2x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:13] kostajh: failed again, seems like the same failure [12:46:34] jnuche: OK. Maybe ping Growth team as maintainers of CommunityConfiguration? Is there a task for the test failure already? [12:46:44] cc urbanecm ^ [12:47:58] (03CR) 10Muehlenhoff: [C:03+1] JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [12:48:10] (03PS4) 10Cathal Mooney: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) [12:48:35] kostajh: I didn't find anything with a quick Phab search [12:48:42] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:49:39] (03CR) 10CI reject: [V:04-1] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [12:50:25] kostajh: I need to leave soon for a doctor's appointment, sorry about that. andre should be able to help with the backport once the problem with the tests is solved [12:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [12:51:07] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@09893e3]: bump section topics to v1.7.0 [12:53:42] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:54:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:54:12] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@09893e3]: bump section topics to v1.7.0 (duration: 03m 20s) [12:54:25] FIRING: [2x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:14] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye [12:59:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6001.drmrs.wmnet to drbd [13:00:04] Urbanecm and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1300). [13:00:04] TheresNoTime and EggRoll97: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:18] o/ [13:01:28] (03PS5) 10Cathal Mooney: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) [13:03:10] (03CR) 10CI reject: [V:04-1] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [13:03:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166155 (https://phabricator.wikimedia.org/T377978) (owner: 10Samtar) [13:03:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165635 (https://phabricator.wikimedia.org/T398137) (owner: 10EggRoll97) [13:04:25] RESOLVED: [2x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:27] (03Merged) 10jenkins-bot: InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166155 (https://phabricator.wikimedia.org/T377978) (owner: 10Samtar) [13:04:29] (03Merged) 10jenkins-bot: Allow abusefilter block action on plwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165635 (https://phabricator.wikimedia.org/T398137) (owner: 10EggRoll97) [13:04:46] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1166155|InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default (T377978)]], [[gerrit:1165635|Allow abusefilter block action on plwikiquote (T398137)]] [13:04:51] T377978: [STORY] Template favouriting available on all foundation wikis - https://phabricator.wikimedia.org/T377978 [13:04:52] T398137: Allow blocking by abuse filter on Polish Wikiquote - https://phabricator.wikimedia.org/T398137 [13:05:15] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10972482 (10LSobanski) p:05Low→03Medium [13:06:38] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166192 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:07:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1166194 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [13:08:42] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:08:43] !log samtar@deploy1003 samtar, eggroll97: Backport for [[gerrit:1166155|InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default (T377978)]], [[gerrit:1165635|Allow abusefilter block action on plwikiquote (T398137)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:48] EggRoll97: ready to test on mwdebug [13:08:54] * TheresNoTime is also testing their patch [13:10:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6001.drmrs.wmnet to drbd [13:10:12] PROBLEM - Host ncredir6001 is DOWN: PING CRITICAL - Packet loss = 100% [13:10:28] RECOVERY - Host ncredir6001 is UP: PING OK - Packet loss = 0%, RTA = 87.47 ms [13:10:42] TheresNoTime: lgtm [13:11:21] !log samtar@deploy1003 samtar, eggroll97: Continuing with sync [13:13:08] (03PS8) 10Arnaudb: gerrit: sanity checks as a cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833) [13:13:08] (03CR) 10Arnaudb: "This cookbook is testable on any cumin host with:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:13:23] (03CR) 10Ssingh: "1. confd/confctl. Essentially, this is the template:" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:13:42] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:15:35] (03PS3) 10Arnaudb: gerrit: sanity checks cookbook implementation [cookbooks] - 10https://gerrit.wikimedia.org/r/1165880 (https://phabricator.wikimedia.org/T387833) [13:15:35] (03CR) 10Arnaudb: "Implementation of the topology-check cookbook, some readability improvements" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165880 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:16:20] (03CR) 10Volans: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [13:17:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6001.wikimedia.org to drbd [13:18:32] !log sudo cumin 'C:bird' "disable-puppet 'merging CR 1163858'": T374619 [13:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:35] T374619: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619 [13:18:42] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:18:51] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166155|InitialiseSettings: Enable wgTemplateDataEnableDiscovery as default (T377978)]], [[gerrit:1165635|Allow abusefilter block action on plwikiquote (T398137)]] (duration: 14m 04s) [13:18:54] T377978: [STORY] Template favouriting available on all foundation wikis - https://phabricator.wikimedia.org/T377978 [13:18:55] T398137: Allow blocking by abuse filter on Polish Wikiquote - https://phabricator.wikimedia.org/T398137 [13:18:56] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [13:19:03] EggRoll97: done :) [13:19:11] TheresNoTime: another backport done! tysm [13:20:01] !log done UTC afternoon backport window [13:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10972537 (10Jclark-ctr) @Stevemunene This error was showing on console. looks like VD for os was cleared out at some... [13:21:12] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:21:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10972540 (10Jclark-ctr) a:05Stevemunene→03Jclark-ctr [13:21:39] !log sudo cumin -b11 'C:bird' "run-puppet-agent --enable 'merging CR 1163858'": NOOP change T374619 [13:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10972541 (10Jclark-ctr) Also updating bios and idrac firmware [13:22:02] PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100% [13:22:08] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [13:22:09] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1166082 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto) [13:22:29] (03Abandoned) 10Stevemunene: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135049 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [13:22:51] (03CR) 10Federico Ceratto: [C:03+2] CAS: Add wmf group for Zarcillo, remove ops [puppet] - 10https://gerrit.wikimedia.org/r/1166082 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto) [13:23:42] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:24:30] RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [13:24:41] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:26:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6001.wikimedia.org to drbd [13:26:42] PROBLEM - Host doh6001 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:14] RECOVERY - Host doh6001 is UP: PING OK - Packet loss = 0%, RTA = 87.64 ms [13:27:50] (03PS1) 10Muehlenhoff: Add site-specific Cumin alias for aux cluster [puppet] - 10https://gerrit.wikimedia.org/r/1166200 [13:28:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6001.drmrs.wmnet to drbd [13:28:39] (03PS2) 10Muehlenhoff: Add site-specific Cumin alias for aux cluster [puppet] - 10https://gerrit.wikimedia.org/r/1166200 [13:28:42] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:28:55] (03PS6) 10Volans: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [13:29:14] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:30:01] (03CR) 10Ssingh: [V:03+1 C:03+2] P:bird and C:bird::anycast: support exporting Prom metrics [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:30:19] 06SRE, 06DBA, 06serviceops, 05MW-1.44-notes, and 2 others: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10972588 (10Clement_Goubert) [13:30:25] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:25] (03CR) 10CI reject: [V:04-1] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [13:31:17] (03PS7) 10Volans: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [13:31:47] (03PS3) 10Ssingh: hiera: enable exporting anycast-hc prom metrics for O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) [13:32:46] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6137/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:33:14] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:33:54] PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100% [13:34:38] jouncebot: nowandnext [13:34:38] For the next 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1300) [13:34:38] In 0 hour(s) and 55 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1430) [13:34:57] I'm gonna reboot the docker-registry nodes [13:35:35] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet [13:35:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10972622 (10Jclark-ctr) @cmooney i am available to assist [13:36:40] RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [13:38:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6001.drmrs.wmnet to drbd [13:39:13] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye [13:39:50] PROBLEM - Bird Internet Routing Daemon on durum6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:40:05] ^ this is expected and unrelated to the other bird change being rolled out, which is a NOOP [13:40:18] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet [13:40:34] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry2005.codfw.wmnet [13:40:53] !log installing libxml2 security updates on bookworm [13:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:25] FIRING: SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:48] RECOVERY - Bird Internet Routing Daemon on durum6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:42:12] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:42:14] registry2004 is expected? [13:42:19] vgutierrez: yeah, reboot [13:42:22] ack [13:42:42] yeah no something's wrong with the restart [13:42:50] Jul 03 13:42:16 registry2004 docker-registry[2789]: configuration error: open /etc/docker/registry/config.yml: no such file or directory [13:43:17] It's apparently not taken into account by the restart/repool cookbook's check before repool [13:43:57] FIRING: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:44:14] expected, on it [13:44:15] <_joe_> yeah the docker registry is down [13:44:17] !incidents [13:44:18] 6453 (UNACKED) ProbeDown sre (10.2.1.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 codfw) [13:44:18] 6452 (RESOLVED) ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin) [13:44:18] 6451 (RESOLVED) ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin) [13:44:18] 6450 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [13:44:19] 6448 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [13:44:19] yeah i know [13:44:19] 6449 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:44:19] 6447 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [13:44:19] 6446 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:44:22] !ack 6453 [13:44:23] 6453 (ACKED) ProbeDown sre (10.2.1.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 codfw) [13:44:40] slyngs: ^^ could you take that one? I'm on a meeting with Kwaku at the moment [13:44:48] Sure [13:44:58] (03CR) 10Ssingh: [V:03+1] "Changed the patch to roll out only one DNS host to all. Since it's just exporting Prom metrics, I think it's fine." [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:45:00] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2005.codfw.wmnet [13:45:10] (03CR) 10Ssingh: [V:03+1] "I meant *DOH* host not *DNS* :)" [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:46:22] !log sudo cumin 'A:wikidough' "disable-puppet 'merging CR 1163859'" [13:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:25] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:34] PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100% [13:46:43] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [13:46:46] <_joe_> the docker-registry service is probably a leftover [13:46:57] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: enable exporting anycast-hc prom metrics for O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:46:59] <_joe_> given we have multiple registry instances running [13:47:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet [13:47:34] yeah there are two config files [13:47:48] what's weird is why is it down [13:47:54] there are two instances running [13:48:08] It's the CODFW one [13:48:19] I know [13:48:21] <_joe_> claime: yeah I have no idea how it works, but on 1004 it's the same situation [13:48:23] Okay :-) [13:48:44] There are two registry instancres running on the registry servers, one with apus backend and one with swift [13:48:54] and the nginx config is supposed to serve them [13:48:55] <_joe_> yes [13:48:57] RESOLVED: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:02] RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [13:49:15] so registry2005 rebooted correctly [13:49:16] <_joe_> when restarting what I think happened is that the old service tried to take over a port or something [13:49:30] ● docker-registry.service loaded failed failed the Docker toolset to pack, ship, store, and deliver content [13:50:17] <_joe_> claime: yes that service needs to be manually eradicated I guess [13:50:24] yeah [13:50:26] <_joe_> it was removed but not absented [13:50:46] <_joe_> in any case; the registry is back up? [13:51:16] yeah, but I don't understand why it came back up correctly on 2005 and not on 2004 [13:51:16] <_joe_> yes it is [13:51:28] <_joe_> it is serving traffic from 2004? [13:51:44] <_joe_> it is [13:51:53] <_joe_> so it came back correctly there too, eventually [13:52:01] <_joe_> did you start anything manually? [13:52:06] no [13:52:21] <_joe_> so it is actually working [13:52:31] (03PS1) 10Ssingh: Revert "hiera: enable exporting anycast-hc prom metrics for O:wikidough" [puppet] - 10https://gerrit.wikimedia.org/r/1166203 [13:52:34] <_joe_> maybe takes some time to startup? [13:52:50] <_joe_> I'm not even sure it properly failed [13:53:21] it nee3ded puppet to run from what I can see from the puppet logs [13:53:41] <_joe_> what did the puppet run do? start the services? [13:53:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:53:58] _joe_: yrah [13:54:04] 2025-07-03T13:39:18.050082+00:00 registry2004 puppet-agent[1248]: (/Stage[main]/Nginx/Service[nginx]/ensure) ensure changed 'stopped' to 'running' (corrective) [13:54:12] 2025-07-03T13:39:19.058234+00:00 registry2004 puppet-agent[1248]: (/Stage[main]/Docker_registry::Web/Jwt_authorizer::Service[docker-registry-ha-jwt]/Systemd::Service[docker-registry-ha-jwt]/Service[docker-registry-ha-jwt]/ensure) ensure changed 'stopped' to 'running' (corrective) [13:54:27] <_joe_> wait wat [13:54:44] <_joe_> so nginx doesn't start on reboot? lol [13:54:51] :-) [13:54:53] I'mma look at the puppet code [13:55:10] <_joe_> I would look at the logs from the servers and the setup of systemd there [13:55:23] (03CR) 10Ssingh: [C:03+2] Revert "hiera: enable exporting anycast-hc prom metrics for O:wikidough" [puppet] - 10https://gerrit.wikimedia.org/r/1166203 (owner: 10Ssingh) [13:55:25] RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:52] We can also change the ensure => stopped on the registry to absent [13:56:09] <_joe_> that's probably a good idea for that service [13:56:10] yes, that's what I'm doing [13:56:22] plus checking what the nginx systemd definition is doing [13:56:23] <_joe_> but that's not why nginx failed to start, I'd assume [13:56:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10972687 (10cmooney) @Jclark-ctr has replaced the optics both side of the link. Link is up and light levels healthy, we'll see how it goe... [13:57:27] * vgutierrez back [13:57:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10972688 (10Jclark-ctr) Replaced both optics no spares on site now at eqiad [13:57:36] No, it failed to start and then Puppet startet it. I'm think permission bind() to unix:/var/run/nginx-auth/basic.sock faile [13:58:06] yeah so [13:58:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10972689 (10Jclark-ctr) sr4 optics black handle @RobH [13:58:16] (03PS1) 10Ssingh: hiera: enable exporting anycast-hc prom metrics for O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1166204 (https://phabricator.wikimedia.org/T374619) [13:58:22] <_joe_> it needs the ha-service up before nginx [13:58:24] the nginx service cleans up the /var/run/nginx-auth/ dir as PostExecStop [13:58:28] <_joe_> so we need to declare the dependency [13:58:28] This is created by puppet [13:58:39] So it's not present on boot [13:58:52] so when it tries to start it just can't until puppet runs [13:58:53] <_joe_> that dir is created by puppet? [13:58:55] yeah [13:59:13] <_joe_> yeah it needs to be in tmpfiles I guess [13:59:16] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6138/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166204 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:59:24] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet [13:59:27] modules/docker_registry/manifests/web.pp L144 and following [14:00:30] <_joe_> yeah so, either we create the dir before nginx starts in ExecStartPre [14:00:40] <_joe_> which is probably a good idea [14:00:47] what's the purpose of deleting it after service stops? [14:00:56] https://trac.nginx.org/nginx/ticket/753 [14:01:03] <_joe_> or we do that before the auth service starts [14:01:04] we're removing the socket only [14:01:10] not the dir, my bad [14:01:48] it doesn't matter [14:01:50] it's on /var/run [14:01:57] <_joe_> yep [14:01:58] so that's a tmpfs that will be "empty" on system reboot [14:02:02] yeah [14:02:13] as _joe_ mentioned we need to move that to a tmpfile.d config [14:02:16] I'm for adding it to ExecStartPre [14:02:16] <_joe_> and the dir needs to be created [14:02:28] <_joe_> either/or [14:02:30] and/or ExecStartPre [14:02:38] as long as it doesn't fail if it's already there :D [14:02:41] <_joe_> ExecStartPre is easier to understand [14:02:43] mkdir -p should work [14:02:49] We can keep the puppet definition for rights and stuff [14:02:55] Add it as ExecStartPre and remove as a dir Puppet creates [14:02:57] (03CR) 10Ssingh: [C:03+2] prometheus: add dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:03:10] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet [14:03:26] <_joe_> claime: that puppet declaration makes no sense; it should be in tmpfiles.d [14:03:36] <_joe_> and I think we even have a puppet define for it [14:03:46] yep, we have tmpfiles.d in puppet [14:04:05] <_joe_> systemd::tmpfile [14:04:06] systemd::tmpfile [14:04:17] we use it a bunch of places [14:04:17] jinx [14:04:19] happy to patch it if needed [14:04:34] vgutierrez: it's ok I'm on it [14:04:40] cool [14:05:08] archiva does this already, can be copied from there [14:05:37] !log restarting clamav to pick up libxml security updates [14:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:28] (03CR) 10Ssingh: [C:03+2] prometheus: add dnsbox_service_state_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:08:30] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1006.eqiad.wmnet with reason: Maintenance and reboot [14:09:23] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1007.eqiad.wmnet with reason: Maintenance and reboot [14:09:39] (03PS1) 10Muehlenhoff: crm: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/1166205 (https://phabricator.wikimedia.org/T135991) [14:11:33] (03PS1) 10C. Scott Ananian: skin: Omit "rendered with" phrase when the message is disabled [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) [14:12:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) (owner: 10C. Scott Ananian) [14:12:50] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:16] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:58] Also I got tricked again, I started with codfw because I thought it was secondary [14:14:07] forgot they were inverse pooled.., [14:14:20] (03PS1) 10Muehlenhoff: k8s: Add missing definitions for aux-codfw [cookbooks] - 10https://gerrit.wikimedia.org/r/1166209 [14:14:33] (03CR) 10Elukey: [C:03+1] Add site-specific Cumin alias for aux cluster [puppet] - 10https://gerrit.wikimedia.org/r/1166200 (owner: 10Muehlenhoff) [14:14:50] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:06] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:38] (03CR) 10Muehlenhoff: [C:03+2] Add site-specific Cumin alias for aux cluster [puppet] - 10https://gerrit.wikimedia.org/r/1166200 (owner: 10Muehlenhoff) [14:15:40] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:43] 06SRE, 06collaboration-services, 06Traffic: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#10972817 (10ABran-WMF) [14:17:18] !log depooling cp7006 for testing [14:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:34] (03PS29) 10Elukey: Add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) [14:20:28] (03CR) 10Vgutierrez: [C:03+1] hiera: enable exporting anycast-hc prom metrics for O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1166204 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:21:08] (03PS1) 10Ssingh: P:dns::auth::monitoring: add prometheus::dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1166210 (https://phabricator.wikimedia.org/T374619) [14:21:18] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: enable exporting anycast-hc prom metrics for O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1166204 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:22:16] (03CR) 10Elukey: [C:03+1] Remove puppetserver2003 from active Puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1166160 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff) [14:23:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [14:23:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10972864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worke... [14:24:02] (03CR) 10Elukey: "Left aa nit but it looks good to me. Maybe for extra security, let's run PCC to confirm?" [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff) [14:24:05] (03CR) 10CI reject: [V:04-1] skin: Omit "rendered with" phrase when the message is disabled [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) (owner: 10C. Scott Ananian) [14:25:25] (03PS4) 10Clément Goubert: docker_registry: Move nginx auth socket to tmpfiles [puppet] - 10https://gerrit.wikimedia.org/r/1166208 [14:28:08] (03PS2) 10Muehlenhoff: k8s: Add missing definitions for aux-codfw [cookbooks] - 10https://gerrit.wikimedia.org/r/1166209 [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1430) [14:30:44] (03CR) 10Volans: "Nice! I finally found the time for a full pass, sorry for the delay. Consider pretty much all comments as optional except the reported bug" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [14:30:47] (03CR) 10Muehlenhoff: "Sure: PCC at https://puppet-compiler.wmflabs.org/output/1166161/6135/" [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff) [14:31:20] (03CR) 10Volans: [C:03+2] JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [14:31:34] (03CR) 10Volans: [C:03+2] JS links: fix jquery version for bookworm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166192 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [14:31:37] (03PS2) 10Muehlenhoff: Remove puppetserver role frm puppetserver2003 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607) [14:32:13] (03Merged) 10jenkins-bot: JS: remove jquery vendored files [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166188 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [14:32:18] (03CR) 10Clément Goubert: [C:03+2] docker_registry: Move nginx auth socket to tmpfiles [puppet] - 10https://gerrit.wikimedia.org/r/1166208 (owner: 10Clément Goubert) [14:32:26] (03Merged) 10jenkins-bot: JS links: fix jquery version for bookworm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166192 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [14:32:36] !log installing bootstrap4 security updates [14:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:27] (03CR) 10Volans: [C:03+1] "Haven't tested it but looks sane to me." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [14:33:31] (03PS3) 10Muehlenhoff: k8s: Add missing definitions for aux-codfw [cookbooks] - 10https://gerrit.wikimedia.org/r/1166209 [14:34:03] (03CR) 10Elukey: [C:03+1] k8s: Add missing definitions for aux-codfw [cookbooks] - 10https://gerrit.wikimedia.org/r/1166209 (owner: 10Muehlenhoff) [14:37:24] 06SRE, 06collaboration-services, 06Traffic: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#10972892 (10ssingh) Happy to collaborate on this, FWIW. [14:38:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Maintenance [14:38:52] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10972894 (10MoritzMuehlenhoff) [14:38:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2213 (T395241)', diff saved to https://phabricator.wikimedia.org/P78751 and previous config saved to /var/cache/conftool/dbconfig/20250703-143854-fceratto.json [14:39:54] (03PS2) 10Clément Goubert: docker_registry: Disable service [puppet] - 10https://gerrit.wikimedia.org/r/1166213 [14:40:15] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] k8s: Add missing definitions for aux-codfw [cookbooks] - 10https://gerrit.wikimedia.org/r/1166209 (owner: 10Muehlenhoff) [14:40:49] (03CR) 10Clément Goubert: [C:03+2] docker_registry: Disable service [puppet] - 10https://gerrit.wikimedia.org/r/1166213 (owner: 10Clément Goubert) [14:40:56] (03CR) 10Volans: Add support for kubernetes (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [14:42:56] (03CR) 10JHathaway: [C:03+1] Remove puppetserver2003 from serving requests [dns] - 10https://gerrit.wikimedia.org/r/1166159 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff) [14:43:17] !log jmm@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-worker-eqiad [14:43:22] (03CR) 10JHathaway: [C:03+1] Remove puppetserver2003 from active Puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1166160 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff) [14:44:05] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver2003 from serving requests [dns] - 10https://gerrit.wikimedia.org/r/1166159 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff) [14:44:10] !log jmm@dns1004 START - running authdns-update [14:44:33] (03PS1) 10Ssingh: Revert "hiera: enable exporting anycast-hc prom metrics for O:wikidough" [puppet] - 10https://gerrit.wikimedia.org/r/1166215 [14:45:08] (03CR) 10Ssingh: "Will resolve below and try again:" [puppet] - 10https://gerrit.wikimedia.org/r/1166215 (owner: 10Ssingh) [14:45:14] !log jmm@dns1004 END - running authdns-update [14:45:22] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry1004.eqiad.wmnet [14:45:44] (03CR) 10Ssingh: [C:03+2] Revert "hiera: enable exporting anycast-hc prom metrics for O:wikidough" [puppet] - 10https://gerrit.wikimedia.org/r/1166215 (owner: 10Ssingh) [14:45:56] (03PS1) 10Volans: Upstream release v0.6.6 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166216 [14:46:05] (03CR) 10Volans: [C:03+2] Upstream release v0.6.6 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166216 (owner: 10Volans) [14:46:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T395241)', diff saved to https://phabricator.wikimedia.org/P78752 and previous config saved to /var/cache/conftool/dbconfig/20250703-144619-fceratto.json [14:47:00] (03Merged) 10jenkins-bot: Upstream release v0.6.6 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1166216 (owner: 10Volans) [14:48:15] !log repooling cp7006 [14:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:56] (03CR) 10Volans: "Did a full pass, LGTM, beside the previous possible bug" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [14:49:49] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1004.eqiad.wmnet [14:50:14] !log uploaded debmonitor-server,python3-debmonitor_0.6.6 to apt.wikimedia.org bookworm-wikimedia [14:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:36] (03PS30) 10Elukey: Add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) [14:50:43] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1006.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002 [14:50:47] (03CR) 10Elukey: Add support for kubernetes (032 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [14:50:53] (03PS4) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) [14:51:16] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry1005.eqiad.wmnet [14:53:02] (03PS17) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [14:55:05] (03CR) 10JHathaway: dhcp: add a UUID based DHCP config (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [14:55:49] (03CR) 10Volans: [C:03+1] "LGTM" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [14:55:54] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1005.eqiad.wmnet [14:56:56] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1007.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002 [15:00:05] jnuche and jeena: That opportune time for a Train log triage deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1500). [15:01:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P78753 and previous config saved to /var/cache/conftool/dbconfig/20250703-150126-fceratto.json [15:02:06] (03PS31) 10Elukey: Add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) [15:02:16] (03CR) 10Elukey: Add support for kubernetes (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [15:04:37] !log jmm@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-worker-eqiad [15:04:37] (03CR) 10Elukey: [C:03+1] Remove puppetserver role frm puppetserver2003 for decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff) [15:04:50] !log jmm@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-worker-codfw [15:06:00] (03PS7) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 [15:06:31] (03PS5) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:38] (03CR) 10Volans: [C:03+1] "LGTM" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [15:07:44] (03CR) 10JHathaway: reimage: add support for using the host UUID for DHCP (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [15:09:10] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [15:10:42] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs5006.eqsin.wmnet with reason: katran migration [15:10:52] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs5006 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1166134 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [15:14:26] (03CR) 10CI reject: [V:04-1] reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [15:14:56] (03CR) 10Brouberol: [C:03+1] hdfs: set an-worker1176 and 1179 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1166170 (https://phabricator.wikimedia.org/T398027) (owner: 10Stevemunene) [15:16:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P78754 and previous config saved to /var/cache/conftool/dbconfig/20250703-151633-fceratto.json [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:21:44] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs5006.eqsin.wmnet [15:21:44] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs5006.eqsin.wmnet [15:22:29] !log lvs5006 migrated to katran - T396561 [15:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:32] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [15:23:32] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:24:41] (03CR) 10Volans: [C:03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [15:25:46] !log jmm@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-worker-codfw [15:28:29] (03PS1) 10Hnowlan: api-gateway: use latest build of ratelimit service in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166221 (https://phabricator.wikimedia.org/T388804) [15:28:32] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [15:29:17] (03PS1) 10Ssingh: bird/anycast-hc: allow setting SupplementaryGroups for anycast-hc unit [puppet] - 10https://gerrit.wikimedia.org/r/1166222 (https://phabricator.wikimedia.org/T374619) [15:29:19] (03PS1) 10Ssingh: hiera: dnsbox: set supplementary_groups for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) [15:30:49] (03PS2) 10Ssingh: hiera: dnsbox: set supplementary_groups for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) [15:31:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T395241)', diff saved to https://phabricator.wikimedia.org/P78755 and previous config saved to /var/cache/conftool/dbconfig/20250703-153141-fceratto.json [15:31:53] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6142/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [15:32:48] (03CR) 10Clément Goubert: [C:03+1] api-gateway: use latest build of ratelimit service in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166221 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [15:33:16] !log depooling cp7006 for testing [15:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:28] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7006.magru.wmnet with reason: testing [15:38:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye [15:38:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10973083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker117... [15:39:45] (03PS1) 10Ssingh: C:prometheus: dnsbox_service_state_exporter s/define/class [puppet] - 10https://gerrit.wikimedia.org/r/1166224 (https://phabricator.wikimedia.org/T374619) [15:40:12] (03CR) 10JHathaway: [C:03+2] dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [15:41:38] (03CR) 10Hnowlan: [C:03+2] api-gateway: use latest build of ratelimit service in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166221 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [15:42:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [15:42:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10973107 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worke... [15:43:19] (03Merged) 10jenkins-bot: api-gateway: use latest build of ratelimit service in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166221 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [15:46:26] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:46:46] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:47:18] (03PS1) 10Ssingh: team-traffic: add dnsbox alert for service status mistmatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) [15:48:03] (03PS8) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 [15:48:11] (03PS2) 10Ssingh: team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) [15:48:15] (03CR) 10JHathaway: reimage: add support for using the host UUID for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [15:48:41] (03PS6) 10Abijeet Patro: CX: Add virtual-cx-shared DatabaseVirtualDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) [15:49:24] (03CR) 10CI reject: [V:04-1] team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [15:52:02] (03PS3) 10Ssingh: team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) [15:52:40] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:52:54] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:54:20] (03CR) 10CI reject: [V:04-1] reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [15:57:14] (03CR) 10Ssingh: "This is not really ready for review as I need to verify the metrics and the expression. But it's a good template for later so leaving it h" [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [15:57:32] (03CR) 10Volans: [C:03+1] "To be tested with a new spicerack release for the dhcp changes but LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [16:00:05] jhathaway and moritzm: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1600). nyaa~ [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:08] jouncebot: nowandnext [16:02:09] For the next 0 hour(s) and 57 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1600) [16:02:09] In 0 hour(s) and 57 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1700) [16:02:09] In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1700) [16:03:07] (03PS1) 10Michael Große: tests: skip test to allow updating CommunityConfigurationExample [extensions/CommunityConfiguration] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166226 [16:03:27] (03CR) 10Stevemunene: [C:03+2] hdfs: set an-worker1176 and 1179 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1166170 (https://phabricator.wikimedia.org/T398027) (owner: 10Stevemunene) [16:05:54] (03PS2) 10Michael Große: tests: skip test to allow updating CommunityConfigurationExample [extensions/CommunityConfiguration] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166226 (https://phabricator.wikimedia.org/T398624) [16:09:21] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7006.magru.wmnet [16:09:22] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7006.magru.wmnet [16:11:44] !log repooling cp7006 [16:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:07] (03PS1) 10Federico Ceratto: zarcillo: Update egress to idp.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166227 (https://phabricator.wikimedia.org/T384810) [16:12:07] (03CR) 10Federico Ceratto: "A small update to the egress conf, already tested" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166227 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [16:13:08] 10ops-eqiad, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390064#10973219 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [16:14:45] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10973221 (10Jclark-ctr) a:03Jclark-ctr [16:15:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10973222 (10Jclark-ctr) 05Open→03Resolved [16:16:26] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397983#10973223 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [16:17:03] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397852#10973225 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [16:25:18] (03PS1) 10Hnowlan: Revert "api-gateway: use latest build of ratelimit service in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166229 [16:27:48] (03CR) 10Hnowlan: [C:03+2] Revert "api-gateway: use latest build of ratelimit service in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166229 (owner: 10Hnowlan) [16:29:25] (03Merged) 10jenkins-bot: Revert "api-gateway: use latest build of ratelimit service in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166229 (owner: 10Hnowlan) [16:31:50] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [16:32:13] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [16:32:14] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [16:32:27] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [16:35:27] (03PS6) 10Vgutierrez: cache,haproxy: refactor captures to fix x-analytics logging take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) [16:44:20] (03PS7) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) [16:47:04] (03CR) 10Abijeet Patro: CX: Add virtual-cx-shared DatabaseVirtualDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [16:47:21] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [16:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [16:52:38] (03PS8) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) [17:00:05] bd808: #bothumor I � Unicode. All rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1700) [17:01:04] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [17:01:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10973309 (10dcaro) >>! In T394333#10964951, @Andrew wrote: >>>! In T394333#10964303, @ayounsi wrote: >> @Andrew Would it be possible to use a single 25G up... [17:01:11] (03PS9) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) [17:01:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10973310 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-... [17:03:35] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [17:07:29] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650 (10aranyap) 03NEW [17:08:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10973332 (10aranyap) [17:09:05] (03CR) 10Cathal Mooney: [C:03+2] PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [17:10:59] (03Merged) 10jenkins-bot: PuppetDB import: do not import vxlan, openvswitch type interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1166177 (https://phabricator.wikimedia.org/T398464) (owner: 10Cathal Mooney) [17:12:24] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:12:47] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [17:12:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:13:01] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [17:13:09] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [17:13:37] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [17:15:27] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1176.eqiad.wmnet with reason: host reimage [17:17:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10973370 (10SLopes-WMF) As @aranyap's manager, I approve this request. [17:18:47] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1176.eqiad.wmnet with reason: host reimage [17:20:41] (03PS1) 10Volans: UI: improve table grouping by column [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166233 (https://phabricator.wikimedia.org/T397696) [17:22:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:23:31] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs - https://phabricator.wikimedia.org/T398464#10973378 (10cmooney) 05Open→03Resolved a:03cmooney [17:24:38] (03CR) 10Volans: "Let me know what do you think" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166233 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [17:24:40] !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@9088e59]: Synchronize artifacat for airflow_dags/analytics_test [17:24:47] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10973383 (10cmooney) FWIW I seen an interesting talk from the latest Nanog conference about "return loss" on shorter and faster links which can c... [17:24:55] !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@9088e59]: Synchronize artifacat for airflow_dags/analytics_test (duration: 00m 15s) [17:25:41] !log joal@deploy1003 Started deploy [airflow-dags/analytics@9088e59]: Synchronize artifacts for airflow_dags/analytics [17:26:22] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@9088e59]: Synchronize artifacts for airflow_dags/analytics (duration: 00m 40s) [17:33:11] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1176.eqiad.wmnet with OS bullseye [17:33:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10973400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-work... [17:43:35] (03CR) 10A smart kitten: "The CI failure is T398624 (& maybe following 48b91a15 it won't happen any more?)" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) (owner: 10C. Scott Ananian) [17:46:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:08] (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166206 (https://phabricator.wikimedia.org/T398616) (owner: 10C. Scott Ananian) [17:59:40] (03PS1) 10Bartosz Dziewoński: Use FallbackContentHandler for undeployed JsonConfig content handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166236 (https://phabricator.wikimedia.org/T124748) [18:00:04] jnuche and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T1800). [18:01:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166236 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński) [18:09:46] (03PS1) 10Ssingh: C:bird::anycast_healthchecker: notify service on conf file change [puppet] - 10https://gerrit.wikimedia.org/r/1166238 [18:10:48] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6143/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166238 (owner: 10Ssingh)