[00:03:09] 10SRE-swift-storage, 10MediaWiki-File-management: Stuck-hidden file - https://phabricator.wikimedia.org/T423065#11832147 (10Pppery) Still happening: https://en.wikipedia.org/wiki/User_talk:Pppery#File%3AHambonesMeditations.jpg [00:03:11] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [00:03:54] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [00:10:34] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [00:19:08] (03CR) 10Dzahn: [C:03+1] "ah yea, I wanted to do the --name thing anyways" [puppet] - 10https://gerrit.wikimedia.org/r/1272961 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [00:19:20] (03CR) 10Dzahn: [C:03+2] zuul: Name service containers and remove them when stopped [puppet] - 10https://gerrit.wikimedia.org/r/1272961 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [00:21:57] FIRING: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:22:36] !incidents [00:22:36] 7846 (UNACKED) ProbeDown sre (2001:df2:e500:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqsin) [00:22:36] 7845 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [00:22:37] 7844 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [00:22:37] 7843 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [00:22:48] !ack 7846 [00:22:48] 7846 (ACKED) ProbeDown sre (2001:df2:e500:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqsin) [00:24:17] port 80? [00:24:20] that's interesting [00:26:57] RESOLVED: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:26:58] (03PS2) 10Dzahn: zuul: Provide tenant configuration [puppet] - 10https://gerrit.wikimedia.org/r/1272970 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [00:27:27] (03CR) 10Dzahn: "it wanted a license for the config file:) added one" [puppet] - 10https://gerrit.wikimedia.org/r/1272970 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [00:32:19] 10SRE-swift-storage, 10MediaWiki-File-management: Stuck-hidden file - https://phabricator.wikimedia.org/T423065#11832210 (10Ladsgroup) Can't say for sure but maybe that's because of a missing update in the new file schema? See line 192 on the new side https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1273030/1... [00:33:27] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:05] (03CR) 10Dzahn: [C:03+1] "smart-reconfigure is smart :)" [puppet] - 10https://gerrit.wikimedia.org/r/1272970 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [00:36:42] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1272970/8435/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1272970 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [00:41:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:44:57] FIRING: CertAlmostExpired: Certificate for service ssw1-e1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#ssw1-e1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:47:13] (03CR) 10Dzahn: [V:03+1 C:03+2] "it had an issue on the very first puppet run but on second run it looked alright." [puppet] - 10https://gerrit.wikimedia.org/r/1272970 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [00:47:24] (03CR) 10Dzahn: [V:03+1 C:03+2] "please do not worry about "ERROR zuul.GerritConnection.ssh: IsADirectoryError: [Errno 21] Is a directory: '/var/ssh/zuul'" I will fix th" [puppet] - 10https://gerrit.wikimedia.org/r/1272970 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [00:48:07] FIRING: ProbeDown: Service aqs1010-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1010-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:51:32] (03CR) 10Dzahn: [C:03+2] "restarted zuul-nodepool on zuul1001 and they all have the names now:" [puppet] - 10https://gerrit.wikimedia.org/r/1272961 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [00:53:07] FIRING: [2x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:54:57] FIRING: [2x] CertAlmostExpired: Certificate for service lsw1-e1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:57:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11832236 (10Jhancock.wm) a:03Jhancock.wm [00:57:56] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11832239 (10Jhancock.wm) a:03Jhancock.wm [01:03:52] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:04:52] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:09:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1273066 [01:09:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1273066 (owner: 10TrainBranchBot) [01:15:14] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:16:14] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:21:22] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1273066 (owner: 10TrainBranchBot) [01:44:57] FIRING: [3x] CertAlmostExpired: Certificate for service lsw1-e1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:45:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T419635)', diff saved to https://phabricator.wikimedia.org/P91004 and previous config saved to /var/cache/conftool/dbconfig/20260417-014534-fceratto.json [01:45:39] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [01:55:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P91005 and previous config saved to /var/cache/conftool/dbconfig/20260417-015542-fceratto.json [01:59:09] 06SRE, 06Traffic: Investigate port 80 page in text@esams for Ipv6 - https://phabricator.wikimedia.org/T423667 (10jasmine_) 03NEW [02:01:16] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:03:36] 06SRE: Investigate port 80 page in text@esams for Ipv6 - https://phabricator.wikimedia.org/T423667#11832332 (10jasmine_) [02:05:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P91006 and previous config saved to /var/cache/conftool/dbconfig/20260417-020550-fceratto.json [02:07:42] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 25s) [02:09:17] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T419635)', diff saved to https://phabricator.wikimedia.org/P91007 and previous config saved to /var/cache/conftool/dbconfig/20260417-021558-fceratto.json [02:16:03] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [02:16:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1262.eqiad.wmnet with reason: Maintenance [02:16:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1262 (T419635)', diff saved to https://phabricator.wikimedia.org/P91008 and previous config saved to /var/cache/conftool/dbconfig/20260417-021624-fceratto.json [02:34:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [03:25:44] Deployment linkrecommendation-internal in linkrecommendation at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [03:25:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:04:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:14:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T419635)', diff saved to https://phabricator.wikimedia.org/P91009 and previous config saved to /var/cache/conftool/dbconfig/20260417-041454-fceratto.json [04:14:59] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [04:25:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P91010 and previous config saved to /var/cache/conftool/dbconfig/20260417-042502-fceratto.json [04:33:42] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:35:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P91011 and previous config saved to /var/cache/conftool/dbconfig/20260417-043510-fceratto.json [04:41:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:45:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T419635)', diff saved to https://phabricator.wikimedia.org/P91012 and previous config saved to /var/cache/conftool/dbconfig/20260417-044518-fceratto.json [04:45:23] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [04:45:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1263.eqiad.wmnet with reason: Maintenance [04:45:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1263 (T419635)', diff saved to https://phabricator.wikimedia.org/P91013 and previous config saved to /var/cache/conftool/dbconfig/20260417-044543-fceratto.json [04:49:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:49:07] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on pc1011 - https://phabricator.wikimedia.org/T423630#11832436 (10Marostegui) p:05Triage→03Medium Any used disk that we can use for this host? Thanks! [04:53:07] FIRING: [2x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:08:27] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:12:19] (03PS1) 10Marostegui: db2158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1273316 [05:12:59] (03CR) 10Marostegui: [C:03+2] db2158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1273316 (owner: 10Marostegui) [05:13:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2158.codfw.wmnet with reason: Reimage to Trixie [05:13:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2158: Reimage to Trixie [05:13:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2158: Reimage to Trixie [05:16:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:37] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2158.codfw.wmnet with OS trixie [05:33:57] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2158.codfw.wmnet with reason: host reimage [05:37:36] (03PS1) 10Marostegui: Revert "db2158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1273343 [05:39:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2158.codfw.wmnet with reason: host reimage [05:45:12] FIRING: [3x] CertAlmostExpired: Certificate for service lsw1-e1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:50:00] (03CR) 10Marostegui: [C:03+2] Revert "db2158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1273343 (owner: 10Marostegui) [05:56:00] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11832482 (10MoritzMuehlenhoff) [05:58:18] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#11832487 (10LSobanski) @bd808 In addition to the good point raised by @A_smart_kitten above the general intent here is to reduce complexity. Leaving a dependen... [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260417T0600) [06:00:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2158.codfw.wmnet with OS trixie [06:03:22] (03PS3) 10Muehlenhoff: profile::zookeeper::firewall: Also allow passing a list of hosts [puppet] - 10https://gerrit.wikimedia.org/r/1272766 [06:03:40] (03PS4) 10Muehlenhoff: profile::zookeeper::firewall: Also allow passing a list of hosts [puppet] - 10https://gerrit.wikimedia.org/r/1272766 [06:04:20] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2158: repool after maintenance [06:05:55] (03PS1) 10Marostegui: db1201: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1273388 [06:06:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272766 (owner: 10Muehlenhoff) [06:06:40] (03CR) 10Marostegui: [C:03+2] db1201: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1273388 (owner: 10Marostegui) [06:06:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1201.eqiad.wmnet with reason: Reimage to Trixie [06:06:58] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1201: Reimage to Trixie [06:07:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1201: Reimage to Trixie [06:08:21] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1201.eqiad.wmnet with OS trixie [06:12:45] (03PS5) 10Muehlenhoff: profile::zookeeper::firewall: Also allow passing a list of hosts [puppet] - 10https://gerrit.wikimedia.org/r/1272766 [06:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:16:24] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11832500 (10Marostegui) [06:16:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272766 (owner: 10Muehlenhoff) [06:17:02] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11832504 (10Marostegui) [06:22:01] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1201.eqiad.wmnet with reason: host reimage [06:24:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:24:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1201.eqiad.wmnet with reason: host reimage [06:25:06] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T423672 (10Abijithkumar2025) 03NEW Closing this task as invalid due to missing information. [06:26:01] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T423672#11832518 (10Abijithkumar2025) 05Invalid→03Open [06:26:53] (03PS1) 10Marostegui: Revert "db1201: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1273408 [06:27:32] (03CR) 10Marostegui: [C:03+2] Revert "db1201: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1273408 (owner: 10Marostegui) [06:29:19] (03PS1) 10Muehlenhoff: Remove bast1003 from list of bastions [puppet] - 10https://gerrit.wikimedia.org/r/1273413 [06:40:07] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T423672#11832538 (10Aklapper) 05Open→03Invalid @Abijithkumar2025: Again: Please do not create empty tasks but fill out all fields. [06:40:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T419635)', diff saved to https://phabricator.wikimedia.org/P91019 and previous config saved to /var/cache/conftool/dbconfig/20260417-064023-fceratto.json [06:40:28] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [06:46:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1201.eqiad.wmnet with OS trixie [06:48:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1201: after reimage to trixie [06:49:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2158: repool after maintenance [06:50:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P91022 and previous config saved to /var/cache/conftool/dbconfig/20260417-065031-fceratto.json [06:52:01] (03PS1) 10Muehlenhoff: Switch Cloud VPS to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1273441 (https://phabricator.wikimedia.org/T416707) [06:52:17] (03PS2) 10Muehlenhoff: Switch Cloud VPS to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1273441 (https://phabricator.wikimedia.org/T416707) [06:56:50] (03PS1) 10Jelto: gerrit: migrate data gerrit1003 to /srv/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1273449 (https://phabricator.wikimedia.org/T333143) [06:58:32] (03PS1) 10Muehlenhoff: Switch the base images to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1273453 (https://phabricator.wikimedia.org/T423622) [07:00:02] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1273449 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260417T0700) [07:00:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P91023 and previous config saved to /var/cache/conftool/dbconfig/20260417-070039-fceratto.json [07:04:33] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1273449 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [07:05:07] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 06Release-Engineering-Team (Radar), 07User-notice: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#11832582 (10Nemoralis) [07:10:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T419635)', diff saved to https://phabricator.wikimedia.org/P91025 and previous config saved to /var/cache/conftool/dbconfig/20260417-071048-fceratto.json [07:10:52] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:11:05] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:25:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11832624 (10MoritzMuehlenhoff) [07:25:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:25:59] Deployment linkrecommendation-internal in linkrecommendation at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [07:25:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [07:32:15] (03CR) 10Elukey: [C:03+1] Remove Puppet 5 support from Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240877 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:33:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1201: after reimage to trixie [07:44:59] (03CR) 10Brouberol: [C:03+1] "Nothing to add on top of what otto said! Great work!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [07:53:29] (03PS1) 10Daniel Kinzler: rest gateway: add more known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273510 [07:53:47] (03PS1) 10Ayounsi: eqsin: add routed ganeti customer [homer/public] - 10https://gerrit.wikimedia.org/r/1273511 (https://phabricator.wikimedia.org/T421863) [07:54:09] (03CR) 10Elukey: ipmi: rework how to use a different user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [07:59:13] (03CR) 10Ayounsi: [C:03+2] eqsin: add routed ganeti customer [homer/public] - 10https://gerrit.wikimedia.org/r/1273511 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [07:59:46] (03CR) 10Muehlenhoff: "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1273511 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:00:50] (03Merged) 10jenkins-bot: eqsin: add routed ganeti customer [homer/public] - 10https://gerrit.wikimedia.org/r/1273511 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:09:10] FIRING: [2x] GanetiBGPDown: BGP session down between ganeti5007 and cr2-eqsin - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [08:09:25] (03CR) 10Majavah: [C:03+1] Switch Cloud VPS to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1273441 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [08:10:53] (03CR) 10Majavah: Openstack: use debian.net repo rather than the wmf-hosted repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [08:14:10] FIRING: [4x] GanetiBGPDown: BGP session down between ganeti5007 and cr2-eqsin - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [08:17:22] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11832695 (10MoritzMuehlenhoff) This turned out to be also reproducible on classic Ganeti and after some painful debugging given the very early failure (approx 1.5 seconds after Linu... [08:18:02] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11832697 (10MoritzMuehlenhoff) [08:21:40] (03CR) 10Muehlenhoff: [C:03+2] Remove Puppet 5 support from Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240877 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:27:00] (03PS4) 10Atsuko: flink: Install flink in blubber-compatible venv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) [08:33:22] (03CR) 10DCausse: "I908966a32fc264c5084719337812ca303bac509c might make this patch useless since it removes space-discount" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [08:33:56] (03Abandoned) 10DCausse: search: add space-discount for wikidata custom prefix search profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [08:37:58] FIRING: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:37:58] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:38:07] PROBLEM - SSH on cp3072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:07] PROBLEM - SSH on cp3070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:07] PROBLEM - SSH on cp3068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:07] PROBLEM - SSH on cp3067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:07] PROBLEM - SSH on cp3073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:07] PROBLEM - SSH on cp3066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:09] PROBLEM - SSH on cp3069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:09] PROBLEM - SSH on cp3071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:13] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:13] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:13] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:13] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:13] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:13] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:13] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:27] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:27] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:27] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:27] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:27] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:27] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:27] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:38:30] I cant reach any of our services (gerrit, grafana) from Germany [08:38:48] works from France [08:38:48] !ack [08:38:49] 7847 (ACKED) [4x] ProbeDown sre (probes/service esams) [08:38:52] esams has problems? [08:38:59] RECOVERY - SSH on cp3067 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:39:17] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3067 is OK: HTTP OK: HTTP/1.0 200 OK - 37121 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:39:45] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:39:49] Maybe what XioNoX mentioned about gtt? [08:39:52] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:39:57] RECOVERY - SSH on cp3073 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:40:03] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3073 is OK: HTTP OK: HTTP/1.1 200 OK - 50385 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:40:05] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3072 is OK: HTTP OK: HTTP/1.1 200 OK - 50371 bytes in 1.841 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:40:18] hm, having a tough time opening wikitech/grafana [08:40:29] from Ireland [08:40:30] FIRING: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 4 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [08:40:40] Works from Spain [08:41:30] with tunnelencabulator it works again [08:41:46] Should we depool esams? [08:41:50] But it does work for me [08:41:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:42:02] I think that was a traffic spike: https://grafana.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status?orgId=1&from=now-3h&to=now&timezone=utc&refresh=5m&viewPanel=panel-2 [08:42:14] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:42:15] oof [08:42:27] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:42:27] Seems to be recovering? [08:42:35] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [08:42:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [08:42:58] FIRING: [4x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [08:43:07] PROBLEM - SSH on cp3067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:43:13] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:43:13] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:43:27] i suppose that makes sense [08:43:45] bjensen: what? [08:43:58] the by-country alert [08:44:00] (03CR) 10Ayounsi: [C:03+1] "Lgtm, I also see that there is a mention of bast1003 in `modules/wmflib/spec/functions/ipresolve_spec.rb`" [puppet] - 10https://gerrit.wikimedia.org/r/1273413 (owner: 10Muehlenhoff) [08:44:05] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:44:17] FIRING: [4x] JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:45:03] is depooling esams likely to intensify the problem? [08:45:05] RECOVERY - SSH on cp3072 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:45:05] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3072 is OK: HTTP OK: HTTP/1.1 200 OK - 50375 bytes in 2.672 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:45:17] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3072 is OK: HTTP OK: HTTP/1.0 200 OK - 37130 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:45:53] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:45:57] RECOVERY - SSH on cp3067 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:46:03] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3067 is OK: HTTP OK: HTTP/1.1 200 OK - 50372 bytes in 0.329 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:46:05] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:46:17] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3067 is OK: HTTP OK: HTTP/1.0 200 OK - 37121 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:46:53] (03PS1) 10Slyngshede: data: align config [puppet] - 10https://gerrit.wikimedia.org/r/1273658 [08:46:57] RECOVERY - SSH on cp3066 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:47:03] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3066 is OK: HTTP OK: HTTP/1.1 200 OK - 50378 bytes in 0.329 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:47:17] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3066 is OK: HTTP OK: HTTP/1.0 200 OK - 37129 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:47:57] FIRING: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:47:58] FIRING: [2x] NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [08:47:59] FIRING: [6x] NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [08:48:06] !ack [08:48:07] All incidents are already acked. [08:48:26] !incidents [08:48:26] 7847 (ACKED) [4x] ProbeDown sre (probes/service esams) [08:48:26] 7848 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [08:48:26] 7846 (RESOLVED) ProbeDown sre (2001:df2:e500:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqsin) [08:48:27] 7845 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [08:48:27] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:48:27] 7844 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [08:48:27] 7843 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [08:49:17] FIRING: [4x] JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:50:30] FIRING: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 3 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [08:50:42] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [08:51:16] !log cmooney@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool esams [reason: no reason specified, no task ID specified] [08:51:19] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool esams [reason: no reason specified, no task ID specified] [08:51:42] !log depool esams due to connectivity issues [08:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:52:57] FIRING: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:53:07] FIRING: [4x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:53:25] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3070 is OK: HTTP OK: HTTP/1.0 200 OK - 37119 bytes in 7.573 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:53:50] (03PS2) 10Dpogorzelski: knative-serving: update chart to 1.21.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271709 [08:53:57] RECOVERY - SSH on cp3070 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:54:03] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3070 is OK: HTTP OK: HTTP/1.1 200 OK - 50370 bytes in 0.329 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:54:17] RESOLVED: [4x] JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:55:30] FIRING: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 3 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [08:56:13] (03PS1) 10Dpogorzelski: kserve-resources: fix securityContext propagation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273664 (https://phabricator.wikimedia.org/T423149) [08:57:54] (03CR) 10CI reject: [V:04-1] kserve-resources: fix securityContext propagation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273664 (https://phabricator.wikimedia.org/T423149) (owner: 10Dpogorzelski) [08:57:58] RESOLVED: [2x] NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [08:57:59] RESOLVED: [6x] NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [09:05:09] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3069 is OK: HTTP OK: HTTP/1.1 200 OK - 50375 bytes in 6.214 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:05:17] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3069 is OK: HTTP OK: HTTP/1.0 200 OK - 37126 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:05:59] RECOVERY - SSH on cp3069 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:07:50] (03CR) 10Elukey: "Really sorry for all the extra work that I caused with the suggestion of removing this, at this point we need to just add the workloadType" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273664 (https://phabricator.wikimedia.org/T423149) (owner: 10Dpogorzelski) [09:08:00] (03CR) 10Atsuko: flink: Install flink in blubber-compatible venv (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [09:08:42] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:30] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 3 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [09:12:41] (03CR) 10Elukey: "Even if it makes no sense, the current CRD allows extra fields, it doesn't really validate them. So it should be ok now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273664 (https://phabricator.wikimedia.org/T423149) (owner: 10Dpogorzelski) [09:13:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1273658 (owner: 10Slyngshede) [09:13:46] (03PS1) 10Jcrespo: backup: Ignore /srv/docker from srv-deployment backups, move cluster mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) [09:14:01] (03PS2) 10Jcrespo: backup: Ignore /srv/docker from srv-deployment backups, move cluster mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) [09:14:10] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) (owner: 10Jcrespo) [09:14:12] !incidents [09:14:12] 7847 (ACKED) [4x] ProbeDown sre (probes/service esams) [09:14:12] 7848 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [09:14:13] 7846 (RESOLVED) ProbeDown sre (2001:df2:e500:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqsin) [09:14:13] 7845 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [09:14:13] 7844 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [09:16:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:51] (03CR) 10Brouberol: [C:03+1] flink: Install flink in blubber-compatible venv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [09:21:00] (03PS1) 10Jelto: gerrit: migrate gerrit2003 data to /srv/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1273683 (https://phabricator.wikimedia.org/T333143) [09:21:27] (03CR) 10Jelto: [C:04-1] "should not be merged yet" [puppet] - 10https://gerrit.wikimedia.org/r/1273683 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [09:21:34] (03CR) 10Atsuko: [C:03+2] flink: Install flink in blubber-compatible venv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [09:25:04] 10SRE-tools, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11832785 (10brouberol) a:03brouberol [09:25:34] (03CR) 10Atsuko: [V:03+2 C:03+2] flink: Install flink in blubber-compatible venv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [09:26:13] 10SRE-tools, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11832790 (10brouberol) @elukey I'd appreciated guidance as to how to build the new `docker-report` deb package, and deploy it. T... [09:29:35] (03CR) 10Jcrespo: "Please have a look and send amends/comments." [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) (owner: 10Jcrespo) [09:32:47] PROBLEM - Host cp3068 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:35] RECOVERY - Host cp3068 is UP: PING OK - Packet loss = 0%, RTA = 80.14 ms [09:34:37] PROBLEM - haproxy process on cp3068 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [09:34:57] RECOVERY - SSH on cp3068 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:35:07] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3068 is OK: HTTP OK: HTTP/1.1 200 OK - 48162 bytes in 0.324 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:35:17] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3068 is OK: HTTP OK: HTTP/1.0 200 OK - 36142 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:35:20] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e3-codfw [09:35:21] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3068 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [09:35:21] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp3068 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [09:35:21] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3068 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [09:35:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e3-codfw [09:35:34] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e1-codfw [09:35:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e1-codfw [09:35:49] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device ssw1-e1-codfw [09:35:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-e1-codfw [09:36:04] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device ssw1-f1-codfw [09:36:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-f1-codfw [09:36:20] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f3-codfw [09:36:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f3-codfw [09:38:21] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3068 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 25 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:38:21] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3068 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-07-06 20:52:29 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:38:21] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp3068 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-06-06 06:58:50 +0000 (expires in 49 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:38:38] (03PS2) 10Jcrespo: dbbackups: Backup only regularly clusters 32 & 33, the read-write ones [puppet] - 10https://gerrit.wikimedia.org/r/1271730 (https://phabricator.wikimedia.org/T421729) [09:38:38] (03PS3) 10Jcrespo: backup: Ignore /srv/docker from srv-deployment backups, move cluster mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) [09:38:38] (03PS1) 10Jcrespo: backup: Remove references to ips of hosts about to be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1273699 (https://phabricator.wikimedia.org/T422851) [09:38:39] RECOVERY - haproxy process on cp3068 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [09:38:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11832838 (10Martyn.ranyard) I approve this, but I think @Scott_French or someone else on that team still needs to add you to the gro... [09:39:13] (03PS2) 10Jcrespo: backup: Remove references to ips of hosts about to be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1273699 (https://phabricator.wikimedia.org/T422851) [09:39:26] (03PS4) 10Jcrespo: backup: Ignore /srv/docker from srv-deployment backups, move cluster mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) [09:39:57] RESOLVED: [3x] CertAlmostExpired: Certificate for service lsw1-e1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:43:59] PROBLEM - Host cp3071 is DOWN: PING CRITICAL - Packet loss = 100% [09:44:34] !log initialise eqsin02 Ganeti cluster T421863 [09:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:37] RECOVERY - Host cp3071 is UP: PING OK - Packet loss = 0%, RTA = 80.19 ms [09:44:37] PROBLEM - haproxy process on cp3071 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [09:44:37] PROBLEM - statsv Varnishkafka log producer on cp3071 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:44:38] T421863: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863 [09:44:59] RECOVERY - SSH on cp3071 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:45:05] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3071 is OK: HTTP OK: HTTP/1.1 200 OK - 48144 bytes in 0.323 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:45:17] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3071 is OK: HTTP OK: HTTP/1.0 200 OK - 36141 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:45:29] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3071 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [09:45:29] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp3071 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [09:45:29] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3071 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [09:45:37] RECOVERY - statsv Varnishkafka log producer on cp3071 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:47:57] RESOLVED: [2x] ProbeDown: Service text:80 has failed probes (http_text_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:48:05] nice [09:48:07] FIRING: [4x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:48:29] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3071 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 25 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:48:29] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp3071 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-06-06 06:58:50 +0000 (expires in 49 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:48:29] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3071 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-07-06 20:52:29 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:48:39] RECOVERY - haproxy process on cp3071 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [09:48:42] !incidents [09:48:43] 7847 (RESOLVED) [4x] ProbeDown sre (probes/service esams) [09:48:43] 7848 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [09:48:43] 7846 (RESOLVED) ProbeDown sre (2001:df2:e500:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqsin) [09:48:43] 7845 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [09:48:43] 7844 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [09:50:02] (03PS1) 10Muehlenhoff: Netbox: Add eqsin02 to Ganeti sync [puppet] - 10https://gerrit.wikimedia.org/r/1273700 (https://phabricator.wikimedia.org/T421863) [09:50:03] (03CR) 10Jcrespo: "I believe you were the ones to introduce these real ips, making sure you are ok with this change." [puppet] - 10https://gerrit.wikimedia.org/r/1273699 (https://phabricator.wikimedia.org/T422851) (owner: 10Jcrespo) [09:52:01] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:53:51] !log marostegui@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool esams [reason: no reason specified, no task ID specified] [09:53:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool esams [reason: no reason specified, no task ID specified] [09:54:05] !log pool esams [09:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:06] !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts ms-backup1001.eqiad.wmnet [09:56:36] (03CR) 10Ayounsi: [C:03+1] Netbox: Add eqsin02 to Ganeti sync [puppet] - 10https://gerrit.wikimedia.org/r/1273700 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [09:58:57] (03CR) 10Muehlenhoff: [C:03+2] Netbox: Add eqsin02 to Ganeti sync [puppet] - 10https://gerrit.wikimedia.org/r/1273700 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [10:00:56] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [10:02:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11832942 (10brouberol) @Jclark-ctr Feel free to replace the drive whenever convenient for you. Thank you! [10:02:19] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10Toolforge: Adjust WMCS Gitlab CI/CD repo to stop using mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423596#11832943 (10A_smart_kitten) (per T423596#11829501 & as the files seem potentially toolforge-related FWICS; feel free to reta... [10:03:19] (03PS1) 10Jcrespo: backup: Remove old references to ms-backup1001 & ms-backup1002 [puppet] - 10https://gerrit.wikimedia.org/r/1273704 (https://phabricator.wikimedia.org/T422851) [10:03:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [10:03:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:03:55] (03PS6) 10Jelto: gerrit: migrate gerrit_site away from root partition [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [10:04:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T419961)', diff saved to https://phabricator.wikimedia.org/P91029 and previous config saved to /var/cache/conftool/dbconfig/20260417-100401-fceratto.json [10:05:56] (03CR) 10Jelto: [C:04-1] "@abran@wikimedia.org I rebased the change, maybe you can check if it also makes sense to you? -1 for now, should be merged after the migra" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [10:06:39] jynus@cumin1003 decommission (PID 2051147) is awaiting input [10:11:16] thank you mr logmsgbot [10:11:50] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-backup1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [10:12:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T419961)', diff saved to https://phabricator.wikimedia.org/P91030 and previous config saved to /var/cache/conftool/dbconfig/20260417-101233-fceratto.json [10:12:47] there is a new ganeti_cluster: codfw_test for netbox [10:12:58] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-backup1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [10:12:58] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:12:59] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-backup1001.eqiad.wmnet [10:13:56] !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts ms-backup1002.eqiad.wmnet [10:14:25] FIRING: SystemdUnitFailed: netbox_ganeti_eqsin02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:15:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11832984 (10MoritzMuehlenhoff) [10:20:08] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [10:22:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P91031 and previous config saved to /var/cache/conftool/dbconfig/20260417-102241-fceratto.json [10:23:17] (03PS2) 10Muehlenhoff: Remove bast1003 from list of bastions [puppet] - 10https://gerrit.wikimedia.org/r/1273413 [10:25:47] jynus@cumin1003 decommission (PID 2069153) is awaiting input [10:31:08] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-backup1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [10:32:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P91032 and previous config saved to /var/cache/conftool/dbconfig/20260417-103249-fceratto.json [10:34:13] jynus@cumin1003 decommission (PID 2069153) is awaiting input [10:36:13] 10SRE-tools, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 2 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the configured directory - https://phabricator.wikimedia.org/T423689 (10jcrespo) 03NEW [10:36:46] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690 (10MatthewVernon) 03NEW [10:36:51] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11833110 (10MatthewVernon) p:05Triage→03High [10:36:53] 10SRE-tools, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 2 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the configured directory - https://phabricator.wikimedia.org/T423689#11833112 (10jcrespo) Let me know if this box or any other requires inve... [10:37:16] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-backup1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [10:37:16] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:37:17] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-backup1002.eqiad.wmnet [10:37:24] (03CR) 10Clément Goubert: [C:03+1] backup: Ignore /srv/docker from srv-deployment backups, move cluster mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) (owner: 10Jcrespo) [10:37:48] (03CR) 10Jcrespo: [C:03+2] backup: Remove old references to ms-backup1001 & ms-backup1002 [puppet] - 10https://gerrit.wikimedia.org/r/1273704 (https://phabricator.wikimedia.org/T422851) (owner: 10Jcrespo) [10:38:08] (03PS2) 10Clément Goubert: gateway-check: Add matchers for liftwing and recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/1271804 (https://phabricator.wikimedia.org/T422804) [10:38:56] 10SRE-tools, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 2 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11833117 (10jcrespo) [10:40:45] 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11833133 (10Blake) That makes a lot of sense; I've updated the docs with those edits, thanks! I've also lin... [10:41:48] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission ms-backup1001 & ms-backup1002 - https://phabricator.wikimedia.org/T422851#11833134 (10jcrespo) a:05jcrespo→03None [10:42:01] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission ms-backup1001 & ms-backup1002 - https://phabricator.wikimedia.org/T422851#11833138 (10jcrespo) This is ready for dc ops. [10:42:27] (03CR) 10Clément Goubert: [C:03+1] opensearch on k8s: Add semantic-search and ipoid to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [10:42:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T419961)', diff saved to https://phabricator.wikimedia.org/P91033 and previous config saved to /var/cache/conftool/dbconfig/20260417-104257-fceratto.json [10:43:01] (03CR) 10Clément Goubert: [C:03+1] Switch the base images to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1273453 (https://phabricator.wikimedia.org/T423622) (owner: 10Muehlenhoff) [10:43:06] !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts ms-backup2001.codfw.wmnet [10:43:19] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:43:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T419961)', diff saved to https://phabricator.wikimedia.org/P91034 and previous config saved to /var/cache/conftool/dbconfig/20260417-104327-fceratto.json [10:43:32] (03CR) 10Clément Goubert: [C:03+1] opensearch on k8s: Activate semantic-search and ipoid in services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1272909 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [10:44:49] (03PS2) 10Effie Mouzeli: mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) [10:44:59] (03CR) 10Muehlenhoff: [C:03+2] Switch the base images to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1273453 (https://phabricator.wikimedia.org/T423622) (owner: 10Muehlenhoff) [10:45:41] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:47:04] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622#11833147 (10MoritzMuehlenhoff) >>! In T423622#11830895, @thcipriani wrote: > Updated task description to clarify, yes, the S... [10:47:15] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622#11833149 (10MoritzMuehlenhoff) p:05Triage→03High [10:48:36] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [10:50:21] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:50:21] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:51:18] (03PS1) 10Jcrespo: bacula: Reenable copy jobs to the offsite pool [puppet] - 10https://gerrit.wikimedia.org/r/1273738 (https://phabricator.wikimedia.org/T313582) [10:51:31] (03PS2) 10Jcrespo: bacula: Reenable copy jobs to the offsite pool [puppet] - 10https://gerrit.wikimedia.org/r/1273738 (https://phabricator.wikimedia.org/T313582) [10:52:01] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:52:27] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273738 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [10:52:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T419961)', diff saved to https://phabricator.wikimedia.org/P91035 and previous config saved to /var/cache/conftool/dbconfig/20260417-105234-fceratto.json [10:53:28] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-backup2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [10:54:42] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-backup2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [10:54:42] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:54:42] (03PS1) 10Effie Mouzeli: (WIP) update mcrouter module to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273739 [10:54:43] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-backup2001.codfw.wmnet [10:55:21] !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts ms-backup2002.codfw.wmnet [10:55:40] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:57:04] (03PS3) 10Jcrespo: bacula: Reenable copy jobs to the offsite pool [puppet] - 10https://gerrit.wikimedia.org/r/1273738 (https://phabricator.wikimedia.org/T313582) [10:57:10] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273738 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260417T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: May I have your attention please! GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260417T1100) [11:02:05] (03CR) 10Jcrespo: [C:03+2] bacula: Reenable copy jobs to the offsite pool [puppet] - 10https://gerrit.wikimedia.org/r/1273738 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [11:02:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P91036 and previous config saved to /var/cache/conftool/dbconfig/20260417-110242-fceratto.json [11:03:00] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [11:08:18] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-backup2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [11:11:24] jynus@cumin1003 decommission (PID 2112382) is awaiting input [11:11:31] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-backup2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [11:11:31] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:11:32] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-backup2002.codfw.wmnet [11:12:05] (03PS1) 10Jcrespo: backup: Fix wrong storage config for offsite [puppet] - 10https://gerrit.wikimedia.org/r/1273740 (https://phabricator.wikimedia.org/T313582) [11:12:32] (03PS2) 10Jcrespo: backup: Fix wrong storage config for offsite [puppet] - 10https://gerrit.wikimedia.org/r/1273740 (https://phabricator.wikimedia.org/T313582) [11:12:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P91037 and previous config saved to /var/cache/conftool/dbconfig/20260417-111250-fceratto.json [11:14:38] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11833215 (10MoritzMuehlenhoff) [11:15:27] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273740 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [11:15:52] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (backup1013), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:17:40] (03CR) 10Jcrespo: [C:03+2] backup: Fix wrong storage config for offsite [puppet] - 10https://gerrit.wikimedia.org/r/1273740 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [11:19:35] (03PS1) 10Muehlenhoff: sre.ganeti.makevm: Bump memory requirements for VMs to 2G [cookbooks] - 10https://gerrit.wikimedia.org/r/1273743 (https://phabricator.wikimedia.org/T422596) [11:23:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T419961)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260417-112259-fceratto.json [11:23:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:23:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T419961)', diff saved to https://phabricator.wikimedia.org/P91039 and previous config saved to /var/cache/conftool/dbconfig/20260417-112333-fceratto.json [11:25:10] (03PS1) 10Ayounsi: eqsin routed ganeti: use private IPs instead of loopbacks [puppet] - 10https://gerrit.wikimedia.org/r/1273744 (https://phabricator.wikimedia.org/T421863) [11:25:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:25:59] Deployment linkrecommendation-internal in linkrecommendation at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [11:25:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:30:56] RECOVERY - MariaDB Replica Lag: s8 on dbstore1009 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:31:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ms-backup1001 & ms-backup1002 - https://phabricator.wikimedia.org/T422851#11833242 (10Jclark-ctr) a:03Jclark-ctr [11:32:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T419961)', diff saved to https://phabricator.wikimedia.org/P91040 and previous config saved to /var/cache/conftool/dbconfig/20260417-113201-fceratto.json [11:32:25] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1273744 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [11:32:48] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 3 others: decommission clouddb1019.eqiad.wmnet - https://phabricator.wikimedia.org/T423151#11833255 (10Jclark-ctr) [11:32:50] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 3 others: decommission clouddb1019.eqiad.wmnet - https://phabricator.wikimedia.org/T423151#11833256 (10Jclark-ctr) 05Open→03Resolved [11:34:58] (03CR) 10Ayounsi: [C:03+2] eqsin routed ganeti: use private IPs instead of loopbacks [puppet] - 10https://gerrit.wikimedia.org/r/1273744 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [11:35:30] PROBLEM - Host ganeti1031 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:32] PROBLEM - Host netboxdb1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:34] PROBLEM - Host logstash1023 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:04] PROBLEM - Host apt1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:37:00] RECOVERY - Host ganeti1031 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [11:37:30] (03PS1) 10Jcrespo: mediabackup: Remove last references to ms-backup2001 & ms-backup2002 [puppet] - 10https://gerrit.wikimedia.org/r/1273745 (https://phabricator.wikimedia.org/T422852) [11:38:07] FIRING: [5x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:33] (03CR) 10Ayounsi: [C:03+1] sre.ganeti.makevm: Bump memory requirements for VMs to 2G [cookbooks] - 10https://gerrit.wikimedia.org/r/1273743 (https://phabricator.wikimedia.org/T422596) (owner: 10Muehlenhoff) [11:38:37] (03CR) 10Jcrespo: [C:03+2] mediabackup: Remove last references to ms-backup2001 & ms-backup2002 [puppet] - 10https://gerrit.wikimedia.org/r/1273745 (https://phabricator.wikimedia.org/T422852) (owner: 10Jcrespo) [11:39:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11833265 (10ayounsi) [11:39:17] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_global in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:40:32] RECOVERY - Host apt1002 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [11:40:34] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission ms-backup2001 & ms-backup2002 - https://phabricator.wikimedia.org/T422852#11833266 (10jcrespo) a:05jcrespo→03None [11:40:36] RECOVERY - Host logstash1023 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [11:40:36] RECOVERY - Host netboxdb1003 is UP: PING WARNING - Packet loss = 50%, RTA = 0.84 ms [11:40:47] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission ms-backup2001 & ms-backup2002 - https://phabricator.wikimedia.org/T422852#11833270 (10jcrespo) This is ready for dc ops. [11:41:53] (03CR) 10Klausman: [C:03+1] "Just one clarification question, LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271709 (owner: 10Dpogorzelski) [11:42:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P91041 and previous config saved to /var/cache/conftool/dbconfig/20260417-114210-fceratto.json [11:42:20] (03CR) 10Klausman: [C:03+1] kserve-resources: fix securityContext propagation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273664 (https://phabricator.wikimedia.org/T423149) (owner: 10Dpogorzelski) [11:43:07] FIRING: [5x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:44:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job netbox_global in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:44:25] FIRING: [7x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:44] moritzm: any idea why ganeti1031 rebooted? :) ganeti1031:~$ uptime -> 11:45:17 up 8 min, 2 users, load average: 5.94, 4.17, 1.94 [11:45:55] it briefly took down Netbox's DB [11:47:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ms-backup1001 & ms-backup1002 - https://phabricator.wikimedia.org/T422851#11833279 (10Jclark-ctr) [11:47:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ms-backup1001 & ms-backup1002 - https://phabricator.wikimedia.org/T422851#11833280 (10Jclark-ctr) 05Open→03Resolved [11:47:51] possibly some hardware issue, can you connect to the mgmt? it stalls for me [11:48:06] moritzm: I can ssh fine to the host [11:48:17] the host, yes, but also the mgmt? [11:48:24] dunno, haven't tried :) [11:48:37] can you check? otherwise I'll make a dc ops task [11:49:21] the server is quite old, it'll be refreshed sometime next FY [11:49:25] FIRING: [8x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:39] moritzm: the server's mgmt doesn't reply to pings [11:49:44] something is fried [11:50:00] (03CR) 10Muehlenhoff: [C:03+2] sre.ganeti.makevm: Bump memory requirements for VMs to 2G [cookbooks] - 10https://gerrit.wikimedia.org/r/1273743 (https://phabricator.wikimedia.org/T422596) (owner: 10Muehlenhoff) [11:50:21] k, I'll open a DC ops task for this so that they can check it Monday [11:52:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P91042 and previous config saved to /var/cache/conftool/dbconfig/20260417-115218-fceratto.json [11:53:00] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [11:53:27] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [11:54:25] FIRING: [8x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:51] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [11:55:19] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [11:56:28] 10ops-eqiad, 06DC-Ops: Unreachable mgmt on ganeti1031 - https://phabricator.wikimedia.org/T423697 (10MoritzMuehlenhoff) 03NEW [11:58:47] (03CR) 10Elukey: [C:03+1] "\o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [11:59:25] FIRING: [8x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:17] 10ops-eqiad, 06DC-Ops: Unreachable mgmt on ganeti1031 - https://phabricator.wikimedia.org/T423697#11833311 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced power cable. The latch tab on the RJ45 end of the management cable was broken, and the cable had fallen out. [12:02:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T419961)', diff saved to https://phabricator.wikimedia.org/P91043 and previous config saved to /var/cache/conftool/dbconfig/20260417-120226-fceratto.json [12:02:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:02:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T419961)', diff saved to https://phabricator.wikimedia.org/P91044 and previous config saved to /var/cache/conftool/dbconfig/20260417-120255-fceratto.json [12:06:29] (03CR) 10Elukey: [C:03+1] "Really nice! I think it is worth to add a comment about the kubernetes min version to the README for future references." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271709 (owner: 10Dpogorzelski) [12:06:46] (03PS1) 10Tiziano Fogli: thanos/compact: avoid constant Puppet changes [puppet] - 10https://gerrit.wikimedia.org/r/1273762 (https://phabricator.wikimedia.org/T386911) [12:09:25] FIRING: [7x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T419961)', diff saved to https://phabricator.wikimedia.org/P91045 and previous config saved to /var/cache/conftool/dbconfig/20260417-121056-fceratto.json [12:12:00] (03CR) 10Tiziano Fogli: [C:03+1] puppet: remove pyrra modules/profiles [puppet] - 10https://gerrit.wikimedia.org/r/1270996 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [12:12:11] (03CR) 10Tiziano Fogli: [C:03+1] pyrra: remove pyrra/slo/slos dns entries [dns] - 10https://gerrit.wikimedia.org/r/1270995 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [12:12:23] (03CR) 10Tiziano Fogli: [C:03+1] pyrra: remove configuration for web interface [puppet] - 10https://gerrit.wikimedia.org/r/1270992 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [12:12:35] (03CR) 10Tiziano Fogli: [C:03+1] pyrra: ensure absent on package and services [puppet] - 10https://gerrit.wikimedia.org/r/1270974 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [12:13:44] (03CR) 10Tiziano Fogli: [C:03+1] Avoid false positive alerts after Ganeti master failover [puppet] - 10https://gerrit.wikimedia.org/r/1272701 (owner: 10Muehlenhoff) [12:14:25] RESOLVED: [3x] SystemdUnitFailed: netbox_ganeti_eqsin02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:09] 10SRE-tools, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11833335 (10elukey) I already had the docker-report repo checked out in my home dir on build2002, so I pulled your changes and r... [12:17:13] (03CR) 10A smart kitten: Enwikinews: disable lingering FlaggedRevs template processing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271839 (https://phabricator.wikimedia.org/T423512) (owner: 10Pppery) [12:21:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P91046 and previous config saved to /var/cache/conftool/dbconfig/20260417-122104-fceratto.json [12:25:31] (03CR) 10Tiziano Fogli: "The metric mediawiki_http_requests_duration_bucket comes from the ops instance. Ideally, if possible, the best place for a recording rule " [puppet] - 10https://gerrit.wikimedia.org/r/1270480 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [12:25:56] 10SRE-tools, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11833361 (10elukey) The docker-report run now works, but it happened the same also tonight EU time (no growthbook error reported... [12:26:43] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for clouddb1019.mgmt:22 - https://phabricator.wikimedia.org/T423387#11833362 (10Jclark-ctr) 05Open→03Resolved [12:29:04] PROBLEM - MariaDB Replica Lag: m1 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 617.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:31:12] jynus: ^ a big delete happened in bacula or something? looks like it on the replica, nothing important, but just checking [12:31:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P91047 and previous config saved to /var/cache/conftool/dbconfig/20260417-123111-fceratto.json [12:31:16] (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [12:32:05] marostegui: yeah, I was doing some cleanup [12:32:11] jynus: cool, thanks! noted [12:32:12] but didn't expect it was so large [12:32:16] nah no issue [12:32:26] it won't p4ge or anything like that [12:33:01] I think there was some corruption due to old records being stale [12:33:04] and I am fixing that [12:33:45] let me know if it goes too bad [12:36:04] RECOVERY - MariaDB Replica Lag: m1 on db2160 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:36:53] jynus: ^ wohoo [12:36:59] (03PS1) 10Effie Mouzeli: mw-parsoid: bump php-fpm worker number [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273766 [12:37:46] (03CR) 10Jgiannelos: [C:03+1] mw-parsoid: bump php-fpm worker number [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273766 (owner: 10Effie Mouzeli) [12:38:19] (03CR) 10Effie Mouzeli: [C:03+2] mw-parsoid: bump php-fpm worker number [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273766 (owner: 10Effie Mouzeli) [12:40:28] (03Merged) 10jenkins-bot: mw-parsoid: bump php-fpm worker number [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273766 (owner: 10Effie Mouzeli) [12:41:15] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:41:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T419961)', diff saved to https://phabricator.wikimedia.org/P91048 and previous config saved to /var/cache/conftool/dbconfig/20260417-124120-fceratto.json [12:41:41] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:41:42] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [12:41:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T419961)', diff saved to https://phabricator.wikimedia.org/P91049 and previous config saved to /var/cache/conftool/dbconfig/20260417-124149-fceratto.json [12:43:07] FIRING: [4x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:45:32] (03PS2) 10Ottomata: flink-app - default to setting metrics.internal.query-service.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) [12:45:47] (03CR) 10Ottomata: flink-app - default to setting metrics.internal.query-service.port (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [12:46:45] (03PS4) 10Arnaudb: gerrit: update sync-instances cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) [12:47:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:50:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T419961)', diff saved to https://phabricator.wikimedia.org/P91050 and previous config saved to /var/cache/conftool/dbconfig/20260417-125009-fceratto.json [12:50:55] (03PS3) 10A smart kitten: enwikinews: Move override for $wgFlaggedRevsHandleIncludes to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273763 (https://phabricator.wikimedia.org/T423512) [12:51:38] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] enwikinews: Move override for $wgFlaggedRevsHandleIncludes to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273763 (https://phabricator.wikimedia.org/T423512) (owner: 10A smart kitten) [12:51:58] (03CR) 10A smart kitten: "Thanks again for your investigations @lucas.werkmeister@wikimedia.de :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273763 (https://phabricator.wikimedia.org/T423512) (owner: 10A smart kitten) [12:52:43] (03PS1) 10Marostegui: mariadb: Productionize clouddb1032 [puppet] - 10https://gerrit.wikimedia.org/r/1273769 (https://phabricator.wikimedia.org/T409557) [12:52:49] dhinus: ^ [12:53:11] (03CR) 10Marostegui: "Initial patch to be able to start the cloning" [puppet] - 10https://gerrit.wikimedia.org/r/1273769 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [12:53:21] (03CR) 10A smart kitten: Enwikinews: disable lingering FlaggedRevs template processing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271839 (https://phabricator.wikimedia.org/T423512) (owner: 10Pppery) [12:53:37] (03PS3) 10Dpogorzelski: knative-serving: update chart to 1.21.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271709 [12:54:22] (03CR) 10Dpogorzelski: knative-serving: update chart to 1.21.1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271709 (owner: 10Dpogorzelski) [12:54:54] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [12:55:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T419635)', diff saved to https://phabricator.wikimedia.org/P91051 and previous config saved to /var/cache/conftool/dbconfig/20260417-125501-fceratto.json [12:55:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:55:22] marostegui: wait, clouddb1032 has not arrived yet, has it? [12:55:49] dhinus: you are right, I have no idea why I did 32 instead of 24 [12:55:51] let me ammend [12:55:56] ah ok :D [12:56:36] btw when is the x4 split going to happen in prod? [12:56:48] dhinus: I am not sure yet [12:56:49] Amir1: ^ [12:57:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T419635)', diff saved to https://phabricator.wikimedia.org/P91052 and previous config saved to /var/cache/conftool/dbconfig/20260417-125714-fceratto.json [12:57:21] let's move to -data-persistence :) [12:58:21] (03CR) 10Kamila Součková: "@jmeybohm@wikimedia.org it passed! I even ran it a 2nd time because I couldn't believe it!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [12:58:51] 10SRE-tools, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11833429 (10brouberol) I had modified the script on disk with `PYTHONPATH=''` to run the whole command, to ensure our change wou... [12:59:17] 10SRE-tools, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11833432 (10brouberol) Thank you @elukey for your assistance in the review, build and release process! I think we can now close... [13:00:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P91053 and previous config saved to /var/cache/conftool/dbconfig/20260417-130018-fceratto.json [13:00:37] !log jiji@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM wikikube-worker-exp1001.eqiad.wmnet [13:01:22] fabfur, tappof: I would like to do an emergency deploy for https://gerrit.wikimedia.org/r/1273763 (unbreak some enwikinews things, see https://phabricator.wikimedia.org/T423512#11827523 for impact); I’m not sure if it qualifies as an emergency, but as I already tested the fix on mw-experimental I think the risk should also be fairly low. Are SRE [13:01:22] okay with deployment? (cc thcipriani, dduvall, A_smart_kitten) [13:01:23] (03Abandoned) 10Marostegui: mariadb: Productionize clouddb1032 [puppet] - 10https://gerrit.wikimedia.org/r/1273769 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [13:01:35] 10SRE-tools, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11833440 (10elukey) 05In progress→03Resolved Thank you for the code fixes! [13:01:47] (also, A_smart_kitten, do you know how to test the change via the web, beyond checking the value of the $wg variable? ^^) [13:02:43] (I *think* I’m actually seeing latest template versions on the main page, but I might be missing something) [13:02:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:03:08] Lucas_WMDE: not currently, checking the variable on the shell side of things would currently be my idea for how to test this if a deployment happens. i'll read some discussions to see if i can find something that isn't currently working that might be testable [13:03:09] (03CR) 10Kamila Součková: [C:03+2] Revert "shellbox: Setup shellbox-icu72" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270557 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [13:03:31] the edit summary at https://en.wikinews.org/w/index.php?title=Template:Lead_article_1&action=history looks like the main page issue may have been worked around [13:03:36] i believe the main page issue specifically was fixed by a deletion & recreation of the templates (or something like that), xref https://phabricator.wikimedia.org/T423512#11827469 [13:03:48] yeah [13:05:10] (03Merged) 10jenkins-bot: Revert "shellbox: Setup shellbox-icu72" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270557 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [13:06:25] RESOLVED: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:02] !log jiji@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM wikikube-worker-exp1001.eqiad.wmnet [13:07:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P91054 and previous config saved to /var/cache/conftool/dbconfig/20260417-130722-fceratto.json [13:07:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:08:16] Lucas_WMDE: okay, may(TM) be testable with https://en.wikinews.org/wiki/Tour_de_France:_Alberto_Contador_wins_the_grand_tour -- currently all entries in that infobox display a date in february 2025, but https://en.wikinews.org/wiki/Template:Tour_de_France_2007 seems to have since been updated with new dates [13:08:34] (unless that is as a result of something else. in which case, who knows) [13:08:42] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:20] (03CR) 10Arnaudb: [C:03+1] "thanks for the rebase! that makes total sense." [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [13:09:41] A_smart_kitten: sounds good, let’s test that in mw-experimental [13:09:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11833474 (10Jclark-ctr) @brouberol Drive has been Swapped Thanks! [13:10:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P91055 and previous config saved to /var/cache/conftool/dbconfig/20260417-131026-fceratto.json [13:10:41] mw-experimental-eqiad updated [13:11:15] hm, still shows 2025-02-17 dates even after a purge with XWD… [13:11:21] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on pc1011 - https://phabricator.wikimedia.org/T423630#11833478 (10Jclark-ctr) a:03Jclark-ctr Solid State Disk 0:1:6 Removed 6 1787.88 GB SATA [13:11:41] Lucas_WMDE: i'm gonna try special:purge connected to experimental-eqiad [13:11:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on pc1011 - https://phabricator.wikimedia.org/T423630#11833480 (10Marostegui) Thank you! [13:12:15] WORKS (i think) [13:12:30] Lucas_WMDE: can you confirm if you see the new date now? [13:12:32] (03PS1) 10Marostegui: site.pp: Move clouddb1024 to analytics [puppet] - 10https://gerrit.wikimedia.org/r/1273777 (https://phabricator.wikimedia.org/T409557) [13:12:44] yup [13:12:48] what did you do differently :o [13:12:53] teach me your magic [13:13:22] also TIL Special:Purge [13:13:34] just https://en.wikinews.org/wiki/special:purge/Tour_de_France:_Alberto_Contador_wins_the_grand_tour connected to k8s-mw-experimental-eqiad using WikimediaDebug i think [13:13:44] I’m trying another purge without WikimediaDebug to see if it restores the breakage [13:13:49] it does [13:14:19] okay, and now I also got it back into the fixed state [13:14:23] with WikimediaDebug and that backend [13:14:28] but only after a few extra reloads after the purge [13:14:28] o_O [13:14:52] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on pc1011 - https://phabricator.wikimedia.org/T423630#11833487 (10Jclark-ctr) @Marostegui drive has been swapped [13:14:53] anyway, the change is testable, that’s good. so we would just need SRE approval for the deploy [13:15:11] ack [13:16:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on pc1011 - https://phabricator.wikimedia.org/T423630#11833496 (10Marostegui) Rebuilding: ` Raw Size: 1.746 TB [0xdf8fe2b0 Sectors] Firmware state: =====> Rebuild <===== ` [13:16:38] at least we now seem to have confirmed that this might be the fix, whether or not a today-deploy is available :) [13:17:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P91056 and previous config saved to /var/cache/conftool/dbconfig/20260417-131730-fceratto.json [13:17:35] (03PS2) 10Dpogorzelski: kserve-resources: fix securityContext propagation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273664 (https://phabricator.wikimedia.org/T423149) [13:18:39] (03PS1) 10Ottomata: html-enrich - reduce checkpoint timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273779 (https://phabricator.wikimedia.org/T421216) [13:20:06] Lucas_WMDE: A_smart_kitten We're checking if it's okay to proceed. We'll let you know. [13:20:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T419961)', diff saved to https://phabricator.wikimedia.org/P91057 and previous config saved to /var/cache/conftool/dbconfig/20260417-132034-fceratto.json [13:20:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [13:21:04] (03PS1) 10Dreamy Jazz: maintain-views: Hide blocks with bl_deleted set to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1273781 (https://phabricator.wikimedia.org/T414188) [13:21:10] ack, thanks! [13:22:02] !log jiji@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM wikikube-worker-exp2001.codfw.wmnet [13:22:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:24:25] (03CR) 10FNegri: [C:03+1] site.pp: Move clouddb1024 to analytics [puppet] - 10https://gerrit.wikimedia.org/r/1273777 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [13:24:59] (03CR) 10Marostegui: [C:03+2] site.pp: Move clouddb1024 to analytics [puppet] - 10https://gerrit.wikimedia.org/r/1273777 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [13:26:02] !log jiji@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM wikikube-worker-exp2001.codfw.wmnet [13:26:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [13:26:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1229 (T419961)', diff saved to https://phabricator.wikimedia.org/P91058 and previous config saved to /var/cache/conftool/dbconfig/20260417-132628-fceratto.json [13:26:34] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:27:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11833527 (10brouberol) Thank you! [13:27:23] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:27:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T419635)', diff saved to https://phabricator.wikimedia.org/P91059 and previous config saved to /var/cache/conftool/dbconfig/20260417-132738-fceratto.json [13:27:43] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:27:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [13:28:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91060 and previous config saved to /var/cache/conftool/dbconfig/20260417-132802-fceratto.json [13:28:05] Lucas_WMDE: re. "not sure if it qualifies as an emergency" would it be a big problem if we waited until Monday? [13:29:01] (03PS1) 10Marostegui: cloudb1024: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) [13:29:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91061 and previous config saved to /var/cache/conftool/dbconfig/20260417-132914-fceratto.json [13:29:23] sobanski: I don’t think so, though I don’t fully understand the impact of the task [13:29:35] (03CR) 10Marostegui: "Won't be submitted today, will wait for the day of the announcement." [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [13:29:44] it sounds like they worked around the issue for the main page, but the rest of the wiki is still affected [13:30:03] and they don’t have a ton of time left before the wiki closes, so I thought it would be nice if we could unbreak it before the weekend [13:30:05] (03CR) 10Marostegui: "@fnegri@wikimedia.org when could I stop 1015 to clone this one?" [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [13:32:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:33:04] (03PS1) 10Jforrester: ImageListPager: Make sure file and filerevision are in correct order [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1273787 (https://phabricator.wikimedia.org/T423654) [13:33:27] RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T419961)', diff saved to https://phabricator.wikimedia.org/P91062 and previous config saved to /var/cache/conftool/dbconfig/20260417-133359-fceratto.json [13:35:16] (03CR) 10Ottomata: [C:03+2] html-enrich - reduce checkpoint timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273779 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [13:37:16] (03Merged) 10jenkins-bot: html-enrich - reduce checkpoint timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273779 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [13:37:47] (03Abandoned) 10Jcrespo: firewall: Update firewall definitions for mediabackups to Puppet 7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1211145 (https://phabricator.wikimedia.org/T349619) (owner: 10Jcrespo) [13:39:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P91063 and previous config saved to /var/cache/conftool/dbconfig/20260417-133923-fceratto.json [13:39:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11833549 (10MoritzMuehlenhoff) [13:40:06] 10SRE-tools, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 2 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11833550 (10ayounsi) We're not doing Netbox CSV dumps anymore. So you can r... [13:42:05] (03PS1) 10Muehlenhoff: Depool puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1273788 (https://phabricator.wikimedia.org/T423282) [13:42:29] !log bking@apt1002 sudo -E reprepro -C component/opensearch2 include trixie-wikimedia /home/bking/wmf-opensearch-search-plugins-2.19.5+5-trixie/wmf-opensearch-search-plugins_2.19.5+5_amd64.changes [13:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:48] FIRING: PuppetFailure: Puppet has failed on cirrussearch1113:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:42:54] PROBLEM - MariaDB Replica Lag: m1 on db2232 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:43:04] PROBLEM - MariaDB Replica Lag: m1 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:43:09] PROBLEM - MariaDB Replica Lag: m1 on db1217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:43:26] (03CR) 10Muehlenhoff: [C:03+2] Depool puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1273788 (https://phabricator.wikimedia.org/T423282) (owner: 10Muehlenhoff) [13:43:33] !log jmm@dns1004 START - running authdns-update [13:44:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P91064 and previous config saved to /var/cache/conftool/dbconfig/20260417-134408-fceratto.json [13:44:32] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11833559 (10MoritzMuehlenhoff) 05Resolved→03Open This started again and I've just depooled 1002 again. [13:44:59] !log jmm@dns1004 END - running authdns-update [13:45:11] (03PS3) 10Dpogorzelski: kserve-resources: fix securityContext propagation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273664 (https://phabricator.wikimedia.org/T423149) [13:45:55] (03PS1) 10Bking: Revert "opensearch: strip bundled plugins before WMF pkg" [puppet] - 10https://gerrit.wikimedia.org/r/1273789 [13:46:05] (03CR) 10Bking: [V:03+2 C:03+2] Revert "opensearch: strip bundled plugins before WMF pkg" [puppet] - 10https://gerrit.wikimedia.org/r/1273789 (owner: 10Bking) [13:46:54] RECOVERY - MariaDB Replica Lag: m1 on db2232 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:47:15] (03CR) 10CI reject: [V:04-1] kserve-resources: fix securityContext propagation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273664 (https://phabricator.wikimedia.org/T423149) (owner: 10Dpogorzelski) [13:47:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:48:23] ^^ waiting for clients resolving new address [13:49:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P91065 and previous config saved to /var/cache/conftool/dbconfig/20260417-134930-fceratto.json [13:50:08] RECOVERY - MariaDB Replica Lag: m1 on db1217 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:50:31] i've probably got to go at some point shortly (~5/10mins?) - if a deploy is given the go-ahead while i'm gone, Lucas_WMDE: you are free to deploy my patch while i'm gone if you feel comfortable doing so, if not then no worries & i'll schedule it for monday [13:50:35] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] kserve-resources: fix securityContext propagation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273664 (https://phabricator.wikimedia.org/T423149) (owner: 10Dpogorzelski) [13:50:47] ack [13:50:59] thanks! [13:51:03] np :) [13:51:53] (03PS3) 10Elukey: istio: revisit Prometheus buckets for Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) [13:52:29] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:52:33] (03CR) 10Elukey: istio: revisit Prometheus buckets for Wikikube (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [13:52:34] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:52:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:54:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P91066 and previous config saved to /var/cache/conftool/dbconfig/20260417-135416-fceratto.json [13:54:33] !log restarting varnish on cp3066 to clear alerts [13:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:47] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 1 hosts matching query P{cp3066.*} [13:54:49] (03PS1) 10Muehlenhoff: Remove puppetmaster::gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/1273790 (https://phabricator.wikimedia.org/T365798) [13:55:51] (03CR) 10Dpogorzelski: [C:03+2] knative-serving: update chart to 1.21.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271709 (owner: 10Dpogorzelski) [13:55:55] (03PS1) 10Jcrespo: netbox: Remove backups from netbox server, only leave postgres ones [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) [13:56:23] (03CR) 10Andrew Bogott: [C:04-1] Openstack: use debian.net repo rather than the wmf-hosted repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [13:56:55] 10SRE-tools, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 3 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11833584 (10jcrespo) That was the only thing being backed up. ` bacula:... [13:57:05] (03PS1) 10Muehlenhoff: Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1273792 (https://phabricator.wikimedia.org/T365798) [13:57:19] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 1 hosts matching query P{cp3066.*} [13:57:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:57:55] (03PS4) 10Dpogorzelski: knative-serving: update chart to 1.21.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271709 [13:58:19] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] knative-serving: update chart to 1.21.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271709 (owner: 10Dpogorzelski) [13:58:31] (03PS4) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [13:58:31] (03CR) 10Ayounsi: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) (owner: 10Jcrespo) [13:58:46] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 1 hosts matching query P{cp3069.*} [13:59:06] (03CR) 10CI reject: [V:04-1] Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [13:59:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91067 and previous config saved to /var/cache/conftool/dbconfig/20260417-135938-fceratto.json [13:59:45] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:59:54] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:59:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [14:00:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T419635)', diff saved to https://phabricator.wikimedia.org/P91068 and previous config saved to /var/cache/conftool/dbconfig/20260417-140003-fceratto.json [14:00:20] !log restart varnish on cp3069, cp3070, cp3072, cp3073 to clear alerts [14:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:28] 10SRE-tools, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 3 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11833590 (10ayounsi) Yeah, Postgres is where all the data are. So +1 to not... [14:00:31] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 1 hosts matching query P{cp3069.*} [14:01:11] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 1 hosts matching query P{cp3070.*} [14:01:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T419635)', diff saved to https://phabricator.wikimedia.org/P91069 and previous config saved to /var/cache/conftool/dbconfig/20260417-140115-fceratto.json [14:01:51] (03CR) 10Jcrespo: "Will better deploy next week, even if it is a trivial patch (I may ask you to be around in case something goes wrong)." [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) (owner: 10Jcrespo) [14:01:55] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:02:42] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 1 hosts matching query P{cp3070.*} [14:02:43] 10SRE-tools, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 3 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11833597 (10jcrespo) p:05Triage→03Medium a:03jcrespo [14:02:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:03:07] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 1 hosts matching query P{cp3072.*} [14:03:13] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:04:20] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:04:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T419961)', diff saved to https://phabricator.wikimedia.org/P91070 and previous config saved to /var/cache/conftool/dbconfig/20260417-140424-fceratto.json [14:04:40] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 1 hosts matching query P{cp3072.*} [14:04:46] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [14:04:48] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11833600 (10jcrespo) On a trixie system I had an issue in which the host decided to kill big processes rather than becoming slow (something related to stall detection, not due to VM... [14:04:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1233 (T419961)', diff saved to https://phabricator.wikimedia.org/P91071 and previous config saved to /var/cache/conftool/dbconfig/20260417-140454-fceratto.json [14:05:01] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 1 hosts matching query P{cp3073.*} [14:05:06] RECOVERY - MariaDB Replica Lag: m1 on db2160 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:06:33] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1011.eqiad.wmnet with reason: Bootstrapping — T412830 [14:06:37] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [14:06:47] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 1 hosts matching query P{cp3073.*} [14:09:25] !log decommissioning Cassandra, aqs1011 [a,b] — T412830 [14:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:18] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:11:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P91072 and previous config saved to /var/cache/conftool/dbconfig/20260417-141123-fceratto.json [14:11:27] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:12:22] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:12:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T419961)', diff saved to https://phabricator.wikimedia.org/P91073 and previous config saved to /var/cache/conftool/dbconfig/20260417-141222-fceratto.json [14:12:28] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:12:37] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:12:48] RESOLVED: PuppetFailure: Puppet has failed on cirrussearch1113:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:13:27] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:14:13] (03PS2) 10Jcrespo: netbox: Remove backups from netbox server, only leave postgres ones [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) [14:14:30] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:16:16] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:16:37] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:17:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:18:26] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:18:33] (03PS3) 10Jcrespo: netbox: Remove backups from netbox server, only leave postgres ones [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) [14:18:37] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) (owner: 10Jcrespo) [14:20:42] (03PS4) 10Jcrespo: netbox: Remove backups from netbox server, only leave postgres ones [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) [14:20:51] (03PS5) 10Jcrespo: netbox: Remove backups from netbox server, only leave postgres ones [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) [14:20:53] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) (owner: 10Jcrespo) [14:21:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P91074 and previous config saved to /var/cache/conftool/dbconfig/20260417-142130-fceratto.json [14:22:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P91075 and previous config saved to /var/cache/conftool/dbconfig/20260417-142230-fceratto.json [14:22:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:24:21] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11833663 (10Jclark-ctr) a:03Jclark-ctr [14:25:37] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11833677 (10Jclark-ctr) [14:31:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T419635)', diff saved to https://phabricator.wikimedia.org/P91076 and previous config saved to /var/cache/conftool/dbconfig/20260417-143139-fceratto.json [14:31:44] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:31:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [14:32:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T419635)', diff saved to https://phabricator.wikimedia.org/P91077 and previous config saved to /var/cache/conftool/dbconfig/20260417-143204-fceratto.json [14:32:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P91078 and previous config saved to /var/cache/conftool/dbconfig/20260417-143238-fceratto.json [14:32:55] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:32:59] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:33:05] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:33:09] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:33:39] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:33:55] (03PS2) 10Bartosz Dziewoński: Remove temporary `wgOAuth2UsePrefixedSub` feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270882 (https://phabricator.wikimedia.org/T417690) (owner: 10D3r1ck01) [14:34:01] (03CR) 10Bartosz Dziewoński: [C:03+1] Remove temporary `wgOAuth2UsePrefixedSub` feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270882 (https://phabricator.wikimedia.org/T417690) (owner: 10D3r1ck01) [14:34:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T419635)', diff saved to https://phabricator.wikimedia.org/P91079 and previous config saved to /var/cache/conftool/dbconfig/20260417-143416-fceratto.json [14:36:00] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:36:30] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:37:19] (03CR) 10FNegri: "Is next Tuesday ok? Any day next week is fine." [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [14:37:40] (03PS5) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [14:37:40] (03PS1) 10Andrew Bogott: Cloud-vps: use Flamingo config files for Horizon [puppet] - 10https://gerrit.wikimedia.org/r/1273831 [14:37:40] (03PS1) 10Andrew Bogott: Openstack: change spec tests to expect verison Flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1273832 [14:37:41] (03PS1) 10Andrew Bogott: cloud-vps: switch VM openstack references to version Flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1273833 [14:37:42] (03PS1) 10Andrew Bogott: Openstack: remove packages for version Dalmatian [puppet] - 10https://gerrit.wikimedia.org/r/1273834 [14:37:44] (03PS1) 10Andrew Bogott: Openstack: remove packages for version Epoxy [puppet] - 10https://gerrit.wikimedia.org/r/1273835 [14:37:55] (03CR) 10Marostegui: "Tuesday works for me yep. s6 is small so it shouldn't take long (famous last words?)" [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [14:39:06] (03CR) 10CI reject: [V:04-1] Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [14:39:16] (03PS15) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) [14:39:47] (03CR) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [14:42:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T419961)', diff saved to https://phabricator.wikimedia.org/P91080 and previous config saved to /var/cache/conftool/dbconfig/20260417-144247-fceratto.json [14:43:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273831 (owner: 10Andrew Bogott) [14:43:09] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [14:44:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P91081 and previous config saved to /var/cache/conftool/dbconfig/20260417-144424-fceratto.json [14:48:11] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance [14:48:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1254 (T419961)', diff saved to https://phabricator.wikimedia.org/P91082 and previous config saved to /var/cache/conftool/dbconfig/20260417-144819-fceratto.json [14:48:34] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:48:39] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:51:16] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:51:20] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:52:00] sobanski: any update on that deploy request? [14:52:20] (03CR) 10JHathaway: "@jcrespo@wikimedia.org does it cause problems to leave them in test files?" [puppet] - 10https://gerrit.wikimedia.org/r/1273699 (https://phabricator.wikimedia.org/T422851) (owner: 10Jcrespo) [14:53:04] PROBLEM - MariaDB Replica Lag: m1 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:53:08] PROBLEM - MariaDB Replica Lag: m1 on db1217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:53:50] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11833798 (10Aklapper) @DamianZaremba: Could you please answer the last comment? Thanks in advance! [14:53:54] PROBLEM - MariaDB Replica Lag: m1 on db2232 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 658.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:54:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P91083 and previous config saved to /var/cache/conftool/dbconfig/20260417-145432-fceratto.json [14:54:54] RECOVERY - MariaDB Replica Lag: m1 on db2232 is OK: OK slave_sql_lag Replication lag: 25.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:55:13] (03CR) 10Andrew Bogott: [C:03+2] Cloud-vps: use Flamingo config files for Horizon [puppet] - 10https://gerrit.wikimedia.org/r/1273831 (owner: 10Andrew Bogott) [14:55:20] (03CR) 10Andrew Bogott: [C:03+2] Openstack: change spec tests to expect verison Flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1273832 (owner: 10Andrew Bogott) [14:55:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T419961)', diff saved to https://phabricator.wikimedia.org/P91084 and previous config saved to /var/cache/conftool/dbconfig/20260417-145524-fceratto.json [14:56:41] (03PS2) 10Andrew Bogott: cloud-vps: switch VM openstack references to version Flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1273833 [14:56:41] (03PS2) 10Andrew Bogott: Openstack: remove packages for version Dalmatian [puppet] - 10https://gerrit.wikimedia.org/r/1273834 [14:56:41] (03PS2) 10Andrew Bogott: Openstack: remove packages for version Epoxy [puppet] - 10https://gerrit.wikimedia.org/r/1273835 [14:56:42] (03PS6) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [14:58:19] (03CR) 10CI reject: [V:04-1] Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [15:00:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1059:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1059 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:01:20] (03PS7) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [15:02:08] RECOVERY - MariaDB Replica Lag: m1 on db1217 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:04:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T419635)', diff saved to https://phabricator.wikimedia.org/P91085 and previous config saved to /var/cache/conftool/dbconfig/20260417-150440-fceratto.json [15:04:47] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:04:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance [15:05:26] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [15:05:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P91086 and previous config saved to /var/cache/conftool/dbconfig/20260417-150532-fceratto.json [15:05:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1059:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1059 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:08:59] FTR, I’ve received a go-ahead for the emergency deploy mentioned earlier, so I’ll do that soon unless someone shouts [15:10:28] (03PS1) 10Aude: Enable ReadingLists beta feature for all Wikipedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273842 (https://phabricator.wikimedia.org/T420881) [15:11:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273842 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [15:14:21] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: lists2001 has multiple bus errors - https://phabricator.wikimedia.org/T423159#11833878 (10LSobanski) lists2001 is the standby host so it should be safe to reboot. cc @ABran-WMF for confirmation. [15:14:38] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: lists2001 has multiple bus errors - https://phabricator.wikimedia.org/T423159#11833879 (10LSobanski) p:05Triage→03Low [15:15:40] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 118.78 ms [15:15:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P91087 and previous config saved to /var/cache/conftool/dbconfig/20260417-151541-fceratto.json [15:16:04] RECOVERY - MariaDB Replica Lag: m1 on db2160 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:16:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2200.codfw.wmnet with reason: Maintenance [15:17:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2208.codfw.wmnet with reason: Maintenance [15:17:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T419635)', diff saved to https://phabricator.wikimedia.org/P91088 and previous config saved to /var/cache/conftool/dbconfig/20260417-151723-fceratto.json [15:17:28] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:17:38] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:17:43] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:18:25] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:18:30] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:18:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273763 (https://phabricator.wikimedia.org/T423512) (owner: 10A smart kitten) [15:19:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T419635)', diff saved to https://phabricator.wikimedia.org/P91089 and previous config saved to /var/cache/conftool/dbconfig/20260417-151936-fceratto.json [15:19:50] (03Merged) 10jenkins-bot: enwikinews: Move override for $wgFlaggedRevsHandleIncludes to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273763 (https://phabricator.wikimedia.org/T423512) (owner: 10A smart kitten) [15:20:42] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1273763|enwikinews: Move override for $wgFlaggedRevsHandleIncludes to InitialiseSettings.php (T423512)]] [15:20:46] T423512: Some pages on en.WN not transcluding specific templates properly - https://phabricator.wikimedia.org/T423512 [15:21:56] * Lucas_WMDE deploying ^ [15:22:07] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:22:12] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:22:26] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, asmartkitten: Backport for [[gerrit:1273763|enwikinews: Move override for $wgFlaggedRevsHandleIncludes to InitialiseSettings.php (T423512)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:22:45] testing [15:23:03] needed a Ctrl+F5 again (idk why) but yes it seems to work \o/ [15:23:18] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:23:22] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:23:33] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, asmartkitten: Continuing with sync [15:24:19] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:24:24] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:25:44] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:25:48] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:25:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T419961)', diff saved to https://phabricator.wikimedia.org/P91090 and previous config saved to /var/cache/conftool/dbconfig/20260417-152549-fceratto.json [15:25:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:25:59] Deployment linkrecommendation-internal in linkrecommendation at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [15:25:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:26:12] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1259.eqiad.wmnet with reason: Maintenance [15:26:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1259 (T419961)', diff saved to https://phabricator.wikimedia.org/P91091 and previous config saved to /var/cache/conftool/dbconfig/20260417-152620-fceratto.json [15:27:33] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1273763|enwikinews: Move override for $wgFlaggedRevsHandleIncludes to InitialiseSettings.php (T423512)]] (duration: 06m 51s) [15:27:38] T423512: Some pages on en.WN not transcluding specific templates properly - https://phabricator.wikimedia.org/T423512 [15:27:57] 06SRE-OnFire, 06Release-Engineering-Team, 10Scap, 06serviceops-deprecated, 07Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#11833922 (10dancy) >>! In T390531#11831389, @dancy wrote... [15:29:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P91092 and previous config saved to /var/cache/conftool/dbconfig/20260417-152944-fceratto.json [15:31:45] * Lucas_WMDE done deploying [15:33:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T419961)', diff saved to https://phabricator.wikimedia.org/P91093 and previous config saved to /var/cache/conftool/dbconfig/20260417-153354-fceratto.json [15:39:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P91094 and previous config saved to /var/cache/conftool/dbconfig/20260417-153953-fceratto.json [15:41:09] (03CR) 10FNegri: [C:03+2] mariadb: wiki-replicas: add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1270891 (https://phabricator.wikimedia.org/T422806) (owner: 10FNegri) [15:41:38] (03CR) 10FNegri: [C:03+2] mariadb: wiki-replicas: add grants for %_maintain [puppet] - 10https://gerrit.wikimedia.org/r/1270465 (https://phabricator.wikimedia.org/T422806) (owner: 10FNegri) [15:41:42] (03CR) 10FNegri: [C:03+2] mariadb: wiki-replicas: remove redundant grants [puppet] - 10https://gerrit.wikimedia.org/r/1270464 (https://phabricator.wikimedia.org/T422806) (owner: 10FNegri) [15:43:37] (03PS8) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [15:44:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P91095 and previous config saved to /var/cache/conftool/dbconfig/20260417-154402-fceratto.json [15:44:57] (03CR) 10CI reject: [V:04-1] Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [15:48:10] (03PS1) 10Effie Mouzeli: mw-parsoid: bump envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273855 (https://phabricator.wikimedia.org/T420336) [15:49:27] (03CR) 10Jgiannelos: [C:03+1] mw-parsoid: bump envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273855 (https://phabricator.wikimedia.org/T420336) (owner: 10Effie Mouzeli) [15:49:56] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:50:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T419635)', diff saved to https://phabricator.wikimedia.org/P91096 and previous config saved to /var/cache/conftool/dbconfig/20260417-155001-fceratto.json [15:50:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:50:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance [15:50:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2218 (T419635)', diff saved to https://phabricator.wikimedia.org/P91097 and previous config saved to /var/cache/conftool/dbconfig/20260417-155015-fceratto.json [15:51:26] (03PS2) 10Effie Mouzeli: mw-parsoid: bump envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273855 (https://phabricator.wikimedia.org/T420336) [15:52:26] (03CR) 10Marostegui: "@fnegri@wikimedia.org this looks good right?" [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [15:52:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T419635)', diff saved to https://phabricator.wikimedia.org/P91098 and previous config saved to /var/cache/conftool/dbconfig/20260417-155228-fceratto.json [15:52:41] (03CR) 10Jgiannelos: [C:03+1] mw-parsoid: bump envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273855 (https://phabricator.wikimedia.org/T420336) (owner: 10Effie Mouzeli) [15:53:25] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:53:47] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:54:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:54:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P91099 and previous config saved to /var/cache/conftool/dbconfig/20260417-155410-fceratto.json [15:56:22] (03CR) 10Effie Mouzeli: [C:03+2] mw-parsoid: bump envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273855 (https://phabricator.wikimedia.org/T420336) (owner: 10Effie Mouzeli) [15:57:48] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11834010 (10elukey) @VRiley-WMF sorry for the lag, completely lost this task! I tried today to run the provision script but I get stuck in the first check_connection: ` 2026-0... [15:58:30] (03Merged) 10jenkins-bot: mw-parsoid: bump envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273855 (https://phabricator.wikimedia.org/T420336) (owner: 10Effie Mouzeli) [15:58:53] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:59:04] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:59:15] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:59:44] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:01:50] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:02:21] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:02:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P91100 and previous config saved to /var/cache/conftool/dbconfig/20260417-160236-fceratto.json [16:04:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T419961)', diff saved to https://phabricator.wikimedia.org/P91101 and previous config saved to /var/cache/conftool/dbconfig/20260417-160418-fceratto.json [16:08:45] (03CR) 10FNegri: [C:03+1] "Yep sorry, forgot to +1!" [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [16:09:17] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:12:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P91102 and previous config saved to /var/cache/conftool/dbconfig/20260417-161245-fceratto.json [16:15:30] (03PS9) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [16:15:30] (03PS1) 10Andrew Bogott: cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [16:16:34] (03CR) 10CI reject: [V:04-1] Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [16:17:27] (03CR) 10CI reject: [V:04-1] cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [16:19:23] (03PS2) 10Andrew Bogott: cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [16:22:47] (03PS10) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [16:22:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T419635)', diff saved to https://phabricator.wikimedia.org/P91103 and previous config saved to /var/cache/conftool/dbconfig/20260417-162253-fceratto.json [16:22:58] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:23:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2221.codfw.wmnet with reason: Maintenance [16:23:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2221 (T419635)', diff saved to https://phabricator.wikimedia.org/P91104 and previous config saved to /var/cache/conftool/dbconfig/20260417-162307-fceratto.json [16:23:15] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [16:23:48] (03CR) 10CI reject: [V:04-1] Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [16:25:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719 (10JMeybohm) 03NEW [16:25:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T419635)', diff saved to https://phabricator.wikimedia.org/P91105 and previous config saved to /var/cache/conftool/dbconfig/20260417-162520-fceratto.json [16:27:57] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11834146 (10JMeybohm) [16:33:09] (03PS3) 10Andrew Bogott: cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [16:33:09] (03PS11) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [16:34:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:38] (03CR) 10CI reject: [V:04-1] Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [16:35:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P91107 and previous config saved to /var/cache/conftool/dbconfig/20260417-163528-fceratto.json [16:35:55] (03PS4) 10Andrew Bogott: cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [16:35:55] (03PS12) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [16:36:46] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723 (10herron) 03NEW [16:37:01] (03PS5) 10Andrew Bogott: cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [16:37:01] (03PS13) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [16:37:20] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [16:37:57] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11834206 (10herron) p:05Triage→03High [16:38:26] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11834210 (10herron) [16:38:31] (03CR) 10CI reject: [V:04-1] Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [16:38:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Hurricane Electric (2001:504:30::ba00:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:39:02] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724 (10Papaul) 03NEW [16:39:29] (03CR) 10CI reject: [V:04-1] cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [16:39:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719#11834227 (10wiki_willy) Thanks @Clement_Goubert! @VRiley-WMF & @Jclark-ctr - this is the... [16:42:10] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11834242 (10herron) @brouberol, do the steps in the description look ok to you, anything I missed? [16:42:40] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11834243 (10herron) a:03herron [16:43:05] (03PS6) 10Andrew Bogott: cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [16:43:05] (03PS14) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [16:43:23] FIRING: [4x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:43:37] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [16:44:29] (03CR) 10CI reject: [V:04-1] Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [16:45:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P91108 and previous config saved to /var/cache/conftool/dbconfig/20260417-164536-fceratto.json [16:52:13] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11834286 (10CDanis) FYI: as of my Puppet patches above, you can now use an x... [16:55:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T419635)', diff saved to https://phabricator.wikimedia.org/P91110 and previous config saved to /var/cache/conftool/dbconfig/20260417-165544-fceratto.json [16:55:49] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:55:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2222.codfw.wmnet with reason: Maintenance [16:56:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2222 (T419635)', diff saved to https://phabricator.wikimedia.org/P91111 and previous config saved to /var/cache/conftool/dbconfig/20260417-165559-fceratto.json [16:58:03] (03PS7) 10Andrew Bogott: cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [16:58:03] (03PS15) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [16:58:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [16:58:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T419635)', diff saved to https://phabricator.wikimedia.org/P91112 and previous config saved to /var/cache/conftool/dbconfig/20260417-165811-fceratto.json [16:58:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Hurricane Electric (2001:504:30::ba00:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:58:51] (03CR) 10CI reject: [V:04-1] cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [17:00:28] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11834314 (10Dzahn) Once the setup issues are resolved we will implement Phabricator and replace phab2002 with it over at T423727. [17:03:52] (03PS1) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [17:05:00] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11834353 (10Dzahn) 05Resolved→03Open a:05Scott_French→03None [17:05:02] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11834355 (10CDanis) BTW -- here are two canned queries for distributed trace... [17:05:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:07:30] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:08:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:08:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P91113 and previous config saved to /var/cache/conftool/dbconfig/20260417-170819-fceratto.json [17:09:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:10:11] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:10:19] (03PS2) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [17:10:22] (03CR) 10CDanis: [V:03+1 C:03+1] "Amir or someone, please roll this out in my absence on Monday :D" [puppet] - 10https://gerrit.wikimedia.org/r/1272869 (owner: 10CDanis) [17:10:40] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:10:47] 06SRE-OnFire: List cookbooks on wikitech - https://phabricator.wikimedia.org/T423730 (10jijiki) 03NEW [17:10:49] (03PS1) 10Dzahn: etherpad: update warning message about truncation to end of May [puppet] - 10https://gerrit.wikimedia.org/r/1273889 (https://phabricator.wikimedia.org/T415237) [17:11:21] (03CR) 10Pppery: etherpad: update warning message about truncation to end of May (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273889 (https://phabricator.wikimedia.org/T415237) (owner: 10Dzahn) [17:11:34] (03CR) 10Dzahn: [C:03+2] etherpad: update warning message about truncation to end of May [puppet] - 10https://gerrit.wikimedia.org/r/1273889 (https://phabricator.wikimedia.org/T415237) (owner: 10Dzahn) [17:11:54] (03PS3) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [17:12:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:12:12] (03PS2) 10Dzahn: etherpad: update warning message about truncation to end of May [puppet] - 10https://gerrit.wikimedia.org/r/1273889 (https://phabricator.wikimedia.org/T415237) [17:12:18] (03CR) 10Dzahn: etherpad: update warning message about truncation to end of May (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273889 (https://phabricator.wikimedia.org/T415237) (owner: 10Dzahn) [17:13:00] (03CR) 10Dzahn: [C:03+2] etherpad: update warning message about truncation to end of May [puppet] - 10https://gerrit.wikimedia.org/r/1273889 (https://phabricator.wikimedia.org/T415237) (owner: 10Dzahn) [17:14:19] (03PS8) 10Andrew Bogott: cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [17:14:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [17:14:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:16:21] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11834387 (10BTullis) I have drafted some SLI definitions here: https://wikitech.wikimed... [17:18:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P91114 and previous config saved to /var/cache/conftool/dbconfig/20260417-171827-fceratto.json [17:24:11] (03PS4) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [17:25:05] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11834420 (10BTullis) It's worth noting that when I started working on the latency indic... [17:25:06] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:28:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T419635)', diff saved to https://phabricator.wikimedia.org/P91116 and previous config saved to /var/cache/conftool/dbconfig/20260417-172835-fceratto.json [17:28:40] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:28:49] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11834430 (10BTullis) There are also some draft SLOs here: https://wikitech.wikimedia.or... [17:32:26] (03PS9) 10Andrew Bogott: cloud-vps cloudcontrols: add profile::openstack::base::serverpackages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [17:32:26] (03PS16) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [17:34:02] (03PS5) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [17:34:27] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11834455 (10brouberol) cc @elukey as he was the one working on the puppet side of things. It //does// look good to me though! [17:34:28] (03PS10) 10Andrew Bogott: cloud-vps: consolidate inclusion of openstack server/client packages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [17:34:28] (03PS17) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [17:34:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:35:51] (03CR) 10CI reject: [V:04-1] cloud-vps: consolidate inclusion of openstack server/client packages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [17:36:02] (03PS1) 10Dzahn: etherpad: insert a warning message into the index HTML [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) [17:36:35] (03CR) 10CI reject: [V:04-1] etherpad: insert a warning message into the index HTML [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:36:49] (03CR) 10Dzahn: "using some parts of the message we also use in new pad messages" [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:37:13] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [17:37:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [17:37:36] (03PS2) 10Dzahn: etherpad: insert a warning message into the index HTML [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) [17:40:40] (03CR) 10Pppery: "I would stick a "Bug: T371591" link on this commit, as this finally does what that task actually requested." [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:42:35] (03PS3) 10Dzahn: etherpad: insert a warning message into the index HTML [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) [17:42:50] (03CR) 10Dzahn: "ah, thanks! doing!" [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:43:06] (03CR) 10Pppery: "Also maybe stick a link to https://meta.wikimedia.org/wiki/Etherpad somewhere in the content somewhere. Otherwise looks good to me as far " [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:43:23] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1273902/8439/etherpad1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:46:28] (03CR) 10Dzahn: [V:03+1] "Ok. done! linked "WMF Etherpad" to the URL above." [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:46:53] (03PS4) 10Dzahn: etherpad: insert a warning message into the index HTML [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) [17:47:29] (03CR) 10Pppery: [C:03+1] etherpad: insert a warning message into the index HTML [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:47:42] (03CR) 10Pppery: [C:03+1] "This got lost in PS4" [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:48:04] (03CR) 10Dzahn: [C:03+2] etherpad: insert a warning message into the index HTML [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:49:27] (03CR) 10Jdlrobson: [C:03+1] Enable ReadingLists beta feature for all Wikipedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273842 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [17:49:39] (03PS5) 10Dzahn: etherpad: insert a warning message into the index HTML [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) [17:49:52] (03PS6) 10Dzahn: etherpad: insert a warning message into the index HTML [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) [17:52:42] (03PS6) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [17:52:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:53:13] (03CR) 10CI reject: [V:04-1] OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:55:42] (03CR) 10Dzahn: [C:03+2] etherpad: insert a warning message into the index HTML [puppet] - 10https://gerrit.wikimedia.org/r/1273902 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [17:59:03] (03PS1) 10CDanis: mwscript-k8s: add --output-file flag [puppet] - 10https://gerrit.wikimedia.org/r/1273905 [18:45:55] (03CR) 10Dzahn: [C:03+1] gerrit: migrate data gerrit1003 to /srv/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1273449 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [18:47:57] (03CR) 10Dzahn: gerrit: update sync-instances cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb) [18:51:42] (03CR) 10Dzahn: [C:03+1] "+1 as long as things happen in the right order" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [18:52:34] (03PS7) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [18:53:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [18:54:35] (03CR) 10CI reject: [V:04-1] OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [18:55:31] (03PS8) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [18:57:25] (03PS11) 10Andrew Bogott: cloud-vps: consolidate inclusion of openstack server/client packages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 [18:57:25] (03PS18) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [18:57:39] (03CR) 10CI reject: [V:04-1] OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [18:58:36] (03PS9) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [18:59:28] (03CR) 10CI reject: [V:04-1] OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:02:11] (03PS10) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [19:02:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:04:06] (03CR) 10Dzahn: [C:03+2] jenkins: allow disabling jenkins even on the manager host [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:05:11] FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:09:27] (03CR) 10Dzahn: [C:03+2] "complete noop on all contint* servers - but allows disabling jenkins on legacy hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:11:48] (03PS1) 10Dzahn: contint: disable jenkins on legacy CI hosts [puppet] - 10https://gerrit.wikimedia.org/r/1273919 (https://phabricator.wikimedia.org/T418109) [19:12:28] (03PS11) 10Bking: OpenSearch: Control which plugins we use via systemd PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) [19:12:41] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [19:13:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:14:10] (03PS1) 10CDobbins: data.yaml: remove legacy ssh key; add fido backup [puppet] - 10https://gerrit.wikimedia.org/r/1273925 [19:15:38] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1273919/8441/" [puppet] - 10https://gerrit.wikimedia.org/r/1273919 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:16:02] (03PS7) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [19:16:02] (03PS1) 10CDanis: kubeadm: add kubectl wait-job plugin [puppet] - 10https://gerrit.wikimedia.org/r/1273926 [19:16:45] (03CR) 10CDobbins: [C:03+2] data.yaml: remove legacy ssh key; add fido backup [puppet] - 10https://gerrit.wikimedia.org/r/1273925 (owner: 10CDobbins) [19:16:55] (03CR) 10CDanis: "This is admittedly untested (too many prereq patches in this stack), but should work :)" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [19:17:17] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding phab1006 to eqiad - jclark@cumin1003" [19:17:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding phab1006 to eqiad - jclark@cumin1003" [19:17:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:17:36] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [19:18:25] (03CR) 10CI reject: [V:04-1] fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [19:18:43] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1072.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:19:02] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:19:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:19:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:19:58] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:20:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:20:11] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:20:30] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:20:40] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-esams (185.15.59.145) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:21:11] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1270580/8442/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1270580 (https://phabricator.wikimedia.org/T422895) (owner: 10Dzahn) [19:21:26] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1071.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:21:27] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding phab1006 to eqiad - jclark@cumin1003" [19:21:33] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding phab1006 to eqiad - jclark@cumin1003" [19:21:33] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:21:40] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:21:48] (03PS8) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [19:22:01] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:22:07] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1068.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:22:18] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1067.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:22:34] (03CR) 10Bking: [C:03+2] "self-merging, as PCC suggests only the test server will be affected." [puppet] - 10https://gerrit.wikimedia.org/r/1273887 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:23:28] (03CR) 10Dduvall: "So smart! 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1272970 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [19:24:41] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc1072.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:25:11] FIRING: [5x] BFDdown: BFD session down between cr1-eqiad and fe80::ee38:7300:ce8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:25:30] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1066.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:25:40] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-esams (185.15.59.145) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:25:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [19:25:59] Deployment linkrecommendation-internal in linkrecommendation at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [19:25:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:28:04] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: make gerrit ssh key configurable in Hiera and add it [puppet] - 10https://gerrit.wikimedia.org/r/1270580 (https://phabricator.wikimedia.org/T422895) (owner: 10Dzahn) [19:32:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1071.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:32:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:32:59] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:33:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:33:03] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:33:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1068.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:33:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1067.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:33:22] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1064.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:33:24] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1063.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:33:28] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1065.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:33:37] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1062.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:35:04] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1061.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:35:11] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1059.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:35:30] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1060.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:35:32] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:36:12] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:36:26] (03CR) 10Dzahn: [V:03+1 C:03+2] "The error message we wanted to see. No more other problems and the key is there - it just has not been added on the Gerrit side: "paramiko" [puppet] - 10https://gerrit.wikimedia.org/r/1270580 (https://phabricator.wikimedia.org/T422895) (owner: 10Dzahn) [19:36:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1066.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:36:43] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1057.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:37:02] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:37:26] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:37:47] (03PS1) 10Bking: opensearch: move var up so we can use it earlier [puppet] - 10https://gerrit.wikimedia.org/r/1273937 (https://phabricator.wikimedia.org/T422860) [19:38:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273937 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:38:43] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:40:56] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1063.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:41:15] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:42:03] (03CR) 10Bking: [C:03+2] opensearch: move var up so we can use it earlier [puppet] - 10https://gerrit.wikimedia.org/r/1273937 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:44:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1064.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:44:25] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1065.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:44:32] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1062.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:46:08] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1061.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:47:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1059.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:48:11] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:48:44] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1057.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:49:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1060.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:49:31] (03PS2) 10CDanis: deployment_server: add kubectl wait-job plugin [puppet] - 10https://gerrit.wikimedia.org/r/1273926 [19:49:31] (03PS9) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [19:50:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:50:40] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1060.eqiad.wmnet with OS bookworm [19:50:45] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834773 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1060.eqiad.wmnet with OS bookworm [19:53:31] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:55:17] (03PS1) 10Bking: cloudelastic1012: remove the deliberately-introduced typo [puppet] - 10https://gerrit.wikimedia.org/r/1273943 (https://phabricator.wikimedia.org/T423327) [19:57:20] (03CR) 10Bking: [C:03+2] cloudelastic1012: remove the deliberately-introduced typo [puppet] - 10https://gerrit.wikimedia.org/r/1273943 (https://phabricator.wikimedia.org/T423327) (owner: 10Bking) [19:59:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:03:24] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1055.eqiad.wmnet with OS bookworm [20:03:35] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834789 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1055.eqiad.wmnet with OS bookworm [20:03:47] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1056.eqiad.wmnet with OS bookworm [20:03:52] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834790 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1056.eqiad.wmnet with OS bookworm [20:04:08] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1057.eqiad.wmnet with OS bookworm [20:04:13] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1057.eqiad.wmnet with OS bookworm [20:06:12] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1060.eqiad.wmnet with reason: host reimage [20:08:30] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:09:11] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1059.eqiad.wmnet with OS bookworm [20:09:15] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1064.eqiad.wmnet with OS bookworm [20:09:17] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1059.eqiad.wmnet with OS bookworm [20:09:20] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834793 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1064.eqiad.wmnet with OS bookworm [20:10:21] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1061.eqiad.wmnet with OS bookworm [20:10:29] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1060.eqiad.wmnet with reason: host reimage [20:10:30] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1062.eqiad.wmnet with OS bookworm [20:10:32] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834794 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1061.eqiad.wmnet with OS bookworm [20:10:35] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834795 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1062.eqiad.wmnet with OS bookworm [20:13:21] !log planet1003, planet2003 - rebooting on ganeti level for T422596 [20:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:25] T422596: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596 [20:14:16] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage [20:14:57] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1057.eqiad.wmnet with reason: host reimage [20:15:08] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:17:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:17:28] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:20:09] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1064.eqiad.wmnet with reason: host reimage [20:20:12] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11834804 (10Dzahn) I did: ` [ganeti1046:~] $ sudo gnt-instance modify -B memory=2G planet1003.eqiad.wmnet .. [ganeti1046:~] $ sudo gnt-instance reboot planet1003.eqiad.wmnet ..... [20:20:13] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1059.eqiad.wmnet with reason: host reimage [20:20:19] (03PS10) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) [20:20:19] (03PS1) 10Kamila Součková: Remove k8s version from all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) [20:20:30] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11834806 (10Dzahn) [20:20:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage [20:20:54] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1061.eqiad.wmnet with reason: host reimage [20:21:14] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1062.eqiad.wmnet with reason: host reimage [20:21:51] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2250 to codfw - jhancock@cumin2002" [20:21:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2250 to codfw - jhancock@cumin2002" [20:21:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:22:07] (03CR) 10CI reject: [V:04-1] mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [20:22:13] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2250 [20:22:17] (03CR) 10CI reject: [V:04-1] Remove k8s version from all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [20:22:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2250 [20:22:37] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2251 [20:22:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2251 [20:22:57] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2252 [20:23:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2252 [20:23:10] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2253 [20:23:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2253 [20:24:44] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1064.eqiad.wmnet with reason: host reimage [20:25:13] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:25:26] (03PS2) 10Dzahn: ci: switch jenkins proxy target to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/1254308 (https://phabricator.wikimedia.org/T418521) [20:28:01] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:28:33] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:28:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1060.eqiad.wmnet with OS bookworm [20:28:39] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834825 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1060.eqiad.wmnet with OS bookworm completed: - mc1... [20:28:44] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1061.eqiad.wmnet with reason: host reimage [20:28:49] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1063.eqiad.wmnet with OS bookworm [20:28:56] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1063.eqiad.wmnet with OS bookworm [20:29:01] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1058.eqiad.wmnet with OS bookworm [20:29:05] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1058.eqiad.wmnet with OS bookworm [20:29:53] (03CR) 10Kamila Součková: "OF COURSE SOMEBODY MERGES A COMPLETELY UNRELATED CHANGE THAT BREAKS CI THE MOMENT I FINALLY GET THIS MONTHS OLD THING TO NOT BREAK CI XD" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [20:30:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:31:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding pc2021 to codfw - jhancock@cumin2002" [20:31:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding pc2021 to codfw - jhancock@cumin2002" [20:31:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:31:48] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1065.eqiad.wmnet with OS bookworm [20:31:53] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834837 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1065.eqiad.wmnet with OS bookworm [20:32:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1062.eqiad.wmnet with reason: host reimage [20:33:56] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host pc2021 [20:34:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2021 [20:34:09] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host pc2022 [20:34:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2022 [20:34:26] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host pc2023 [20:34:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2023 [20:34:54] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host pc2024 [20:35:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2024 [20:35:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:36:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1057.eqiad.wmnet with reason: host reimage [20:36:48] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:36:53] RECOVERY - MegaRAID on pc1011 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:37:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host pc2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:37:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:37:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host pc2022.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:37:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1055.eqiad.wmnet with OS bookworm [20:37:36] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1055.eqiad.wmnet with OS bookworm completed: - mc1... [20:37:43] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1066.eqiad.wmnet with OS bookworm [20:37:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host pc2023.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:37:49] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1066.eqiad.wmnet with OS bookworm [20:37:59] (03PS1) 10Dzahn: etherpad: add 90 day deletion warning to default pad message [puppet] - 10https://gerrit.wikimedia.org/r/1273991 (https://phabricator.wikimedia.org/T421315) [20:38:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host pc2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:38:55] (03CR) 10Dzahn: [C:03+2] etherpad: add 90 day deletion warning to default pad message [puppet] - 10https://gerrit.wikimedia.org/r/1273991 (https://phabricator.wikimedia.org/T421315) (owner: 10Dzahn) [20:39:08] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1063.eqiad.wmnet with reason: host reimage [20:40:06] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1058.eqiad.wmnet with reason: host reimage [20:42:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1059.eqiad.wmnet with reason: host reimage [20:42:59] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage [20:43:00] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:43:22] (03CR) 10Catrope: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [20:43:23] FIRING: [4x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:43:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:43:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1064.eqiad.wmnet with OS bookworm [20:43:28] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1064.eqiad.wmnet with OS bookworm completed: - mc1... [20:45:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1058.eqiad.wmnet with reason: host reimage [20:45:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on pc1011 - https://phabricator.wikimedia.org/T423630#11834899 (10Jclark-ctr) 05Open→03Resolved Drive restored error cleared [20:46:45] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:47:26] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:47:27] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1061.eqiad.wmnet with OS bookworm [20:47:32] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1061.eqiad.wmnet with OS bookworm completed: - mc1... [20:47:35] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834905 (10Jclark-ctr) [20:47:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:48:33] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1066.eqiad.wmnet with reason: host reimage [20:48:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2023.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:48:46] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834906 (10Jclark-ctr) [20:48:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:49:36] (03PS1) 10Dzahn: etherpad: stop puppet from adding the same file_line multiple times [puppet] - 10https://gerrit.wikimedia.org/r/1274005 (https://phabricator.wikimedia.org/T420793) [20:50:07] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:50:33] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage [20:50:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:50:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1062.eqiad.wmnet with OS bookworm [20:50:38] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834911 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1062.eqiad.wmnet with OS bookworm completed: - mc1... [20:51:40] (03PS2) 10Dzahn: etherpad: stop puppet from adding the same file_line multiple times [puppet] - 10https://gerrit.wikimedia.org/r/1274005 (https://phabricator.wikimedia.org/T420793) [20:51:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2022.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:53:36] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:54:00] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:54:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1057.eqiad.wmnet with OS bookworm [20:54:06] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1057.eqiad.wmnet with OS bookworm completed: - mc1... [20:54:28] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834918 (10Jclark-ctr) [20:54:30] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834919 (10Jclark-ctr) [20:54:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1066.eqiad.wmnet with reason: host reimage [20:55:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11834920 (10VRiley-WMF) a:03VRiley-WMF [20:55:10] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1067.eqiad.wmnet with OS bookworm [20:55:14] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1067.eqiad.wmnet with OS bookworm [20:56:19] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1068.eqiad.wmnet with OS bookworm [20:56:26] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1068.eqiad.wmnet with OS bookworm [20:59:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1063.eqiad.wmnet with reason: host reimage [20:59:37] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:59:49] (03CR) 10Dzahn: [C:03+2] etherpad: stop puppet from adding the same file_line multiple times [puppet] - 10https://gerrit.wikimedia.org/r/1274005 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [21:00:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:00:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1059.eqiad.wmnet with OS bookworm [21:00:49] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834929 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1059.eqiad.wmnet with OS bookworm completed: - mc1... [21:01:10] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834930 (10Jclark-ctr) [21:02:02] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:02:07] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1069.eqiad.wmnet with OS bookworm [21:02:14] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1069.eqiad.wmnet with OS bookworm [21:02:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:02:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1058.eqiad.wmnet with OS bookworm [21:02:27] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1058.eqiad.wmnet with OS bookworm completed: - mc1... [21:02:44] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834938 (10Jclark-ctr) [21:03:07] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1070.eqiad.wmnet with OS bookworm [21:03:14] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1070.eqiad.wmnet with OS bookworm [21:03:18] (03PS1) 10CDanis: More features and fixes for historical API [software/klaxon] - 10https://gerrit.wikimedia.org/r/1274025 [21:03:18] (03PS1) 10CDanis: Add SQLite-backed incident store, migrations, and weekly analytics CLI [software/klaxon] - 10https://gerrit.wikimedia.org/r/1274026 [21:03:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host pc2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:03:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host pc2022.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:04:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host pc2023.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:04:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host pc2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:04:53] (03CR) 10CI reject: [V:04-1] Add SQLite-backed incident store, migrations, and weekly analytics CLI [software/klaxon] - 10https://gerrit.wikimedia.org/r/1274026 (owner: 10CDanis) [21:05:19] (03PS2) 10CDanis: Add SQLite-backed incident store, migrations, and weekly analytics CLI [software/klaxon] - 10https://gerrit.wikimedia.org/r/1274026 [21:06:35] (03CR) 10CI reject: [V:04-1] Add SQLite-backed incident store, migrations, and weekly analytics CLI [software/klaxon] - 10https://gerrit.wikimedia.org/r/1274026 (owner: 10CDanis) [21:06:55] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1068.eqiad.wmnet with reason: host reimage [21:07:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:07:41] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc1065.eqiad.wmnet with OS bookworm [21:07:51] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834964 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1065.eqiad.wmnet with OS bookworm executed with er... [21:08:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:08:36] (03PS3) 10CDanis: WIP [DNM] Add SQLite-backed incident store and analytics CLI [software/klaxon] - 10https://gerrit.wikimedia.org/r/1274026 [21:09:04] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:09:49] (03CR) 10CI reject: [V:04-1] WIP [DNM] Add SQLite-backed incident store and analytics CLI [software/klaxon] - 10https://gerrit.wikimedia.org/r/1274026 (owner: 10CDanis) [21:09:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2022.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:09:58] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1065.eqiad.wmnet with OS bookworm [21:10:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2023.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:10:09] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1065.eqiad.wmnet with OS bookworm [21:10:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:10:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1068.eqiad.wmnet with reason: host reimage [21:12:47] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:12:51] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1069.eqiad.wmnet with reason: host reimage [21:13:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:13:13] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1066.eqiad.wmnet with OS bookworm [21:13:16] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1066.eqiad.wmnet with OS bookworm completed: - mc1... [21:13:38] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11834986 (10Jclark-ctr) [21:14:16] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc2021'] [21:14:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['pc2021'] [21:14:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:15:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2021.codfw.wmnet with OS trixie [21:15:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11835004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2021.codfw.wmnet with OS trixie [21:15:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2022.codfw.wmnet with OS trixie [21:16:00] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11835005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2022.codfw.wmnet with OS trixie [21:16:03] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:16:10] (03CR) 10Dzahn: [C:04-1] "21 needs to be a string" [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [21:16:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2023.codfw.wmnet with OS trixie [21:16:29] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:16:29] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11835006 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2023.codfw.wmnet with OS trixie [21:16:30] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1063.eqiad.wmnet with OS bookworm [21:16:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2024.codfw.wmnet with OS trixie [21:16:34] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835007 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1063.eqiad.wmnet with OS bookworm completed: - mc1... [21:16:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11835008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2024.codfw.wmnet with OS trixie [21:16:40] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835009 (10Jclark-ctr) [21:17:17] (03CR) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [21:17:27] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835011 (10Jclark-ctr) [21:18:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1069.eqiad.wmnet with reason: host reimage [21:21:48] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage [21:23:06] (03CR) 10JHathaway: [C:03+1] ipmi: rework how to use a different user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [21:23:36] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1071.eqiad.wmnet with OS bookworm [21:23:42] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835020 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1071.eqiad.wmnet with OS bookworm [21:24:04] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc1056.eqiad.wmnet with OS bookworm [21:24:09] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1056.eqiad.wmnet with OS bookworm executed with er... [21:27:26] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:28:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage [21:30:32] jclark@cumin1003 reimage (PID 2544631) is awaiting input [21:32:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2021.codfw.wmnet with reason: host reimage [21:33:06] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:34:29] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1071.eqiad.wmnet with reason: host reimage [21:34:56] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:34:57] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1068.eqiad.wmnet with OS bookworm [21:35:04] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1068.eqiad.wmnet with OS bookworm completed: - mc1... [21:35:27] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835040 (10Jclark-ctr) [21:35:27] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:35:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2022.codfw.wmnet with reason: host reimage [21:35:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:35:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1069.eqiad.wmnet with OS bookworm [21:35:51] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835052 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1069.eqiad.wmnet with OS bookworm completed: - mc1... [21:36:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2023.codfw.wmnet with reason: host reimage [21:36:21] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835054 (10Jclark-ctr) [21:37:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2024.codfw.wmnet with reason: host reimage [21:39:21] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:39:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2021.codfw.wmnet with reason: host reimage [21:41:08] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:41:51] (03CR) 10Jcrespo: "Absolutely not, but the decom script may warn you incorrectly about references left behind or searching could cause false positives for th" [puppet] - 10https://gerrit.wikimedia.org/r/1273699 (https://phabricator.wikimedia.org/T422851) (owner: 10Jcrespo) [21:42:27] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:42:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2023.codfw.wmnet with reason: host reimage [21:45:06] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1065.eqiad.wmnet with OS bookworm [21:45:10] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1065.eqiad.wmnet with OS bookworm completed: - mc1... [21:47:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2022.codfw.wmnet with reason: host reimage [21:48:10] (03PS1) 10Ryan Kemper: cloudelastic1012: override common_settings merge to first [puppet] - 10https://gerrit.wikimedia.org/r/1274061 (https://phabricator.wikimedia.org/T422860) [21:48:45] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1274061 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:49:09] (03CR) 10Bking: [C:03+1] cloudelastic1012: override common_settings merge to first [puppet] - 10https://gerrit.wikimedia.org/r/1274061 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:51:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1071.eqiad.wmnet with reason: host reimage [21:51:19] (03PS1) 10Dzahn: ci::docker: do not try to install docker-cli on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1274067 (https://phabricator.wikimedia.org/T418109) [21:52:11] (03CR) 10Ryan Kemper: "PCC looks terrible, this overrode whole common_settings. fixing" [puppet] - 10https://gerrit.wikimedia.org/r/1274061 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:52:43] jclark@cumin1003 provision (PID 2561685) is awaiting input [21:53:06] (03CR) 10Catrope: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [21:54:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:54:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:54:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2021.codfw.wmnet with OS trixie [21:54:55] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11835103 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2021.codfw.wmnet with OS trixie completed: - pc2021 (**PASS**) - Rem... [21:55:10] (03PS1) 10Ryan Kemper: cloudelastic1012: full common_settings override for OS2 [puppet] - 10https://gerrit.wikimedia.org/r/1274075 (https://phabricator.wikimedia.org/T422860) [21:55:56] (03Abandoned) 10Ryan Kemper: cloudelastic1012: full common_settings override for OS2 [puppet] - 10https://gerrit.wikimedia.org/r/1274075 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:56:40] (03PS2) 10Ryan Kemper: cloudelastic1012: full common_settings override for OS2 [puppet] - 10https://gerrit.wikimedia.org/r/1274061 (https://phabricator.wikimedia.org/T422860) [21:56:56] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1274061 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:57:52] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1072.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:58:51] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:59:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2024.codfw.wmnet with reason: host reimage [22:00:12] (03CR) 10Ryan Kemper: [C:03+2] "PCC looks correct now; merging" [puppet] - 10https://gerrit.wikimedia.org/r/1274061 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [22:00:37] (03PS2) 10Dzahn: ci::docker: only install docker-cli if on trixie or newer [puppet] - 10https://gerrit.wikimedia.org/r/1274067 (https://phabricator.wikimedia.org/T418109) [22:01:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:01:40] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:01:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2023.codfw.wmnet with OS trixie [22:01:52] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11835127 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2023.codfw.wmnet with OS trixie completed: - pc2023 (**PASS**) - Rem... [22:02:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:02:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2022.codfw.wmnet with OS trixie [22:02:22] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11835128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2022.codfw.wmnet with OS trixie completed: - pc2022 (**PASS**) - Rem... [22:02:50] (03CR) 10Thcipriani: [C:03+1] "cherry pick seems to work on integration-docker-agent" [puppet] - 10https://gerrit.wikimedia.org/r/1274067 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [22:06:19] (03PS4) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) [22:06:44] (03CR) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [22:08:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1072.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:08:47] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [22:09:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [22:09:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1071.eqiad.wmnet with OS bookworm [22:09:07] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1071.eqiad.wmnet with OS bookworm completed: - mc1... [22:09:31] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1072.eqiad.wmnet with OS bookworm [22:09:40] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1072.eqiad.wmnet with OS bookworm [22:10:51] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835136 (10Jclark-ctr) [22:13:12] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1056.eqiad.wmnet with OS bookworm [22:13:16] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1056.eqiad.wmnet with OS bookworm [22:13:37] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:15:22] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc1067.eqiad.wmnet with OS bookworm [22:15:25] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835141 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1067.eqiad.wmnet with OS bookworm executed with er... [22:16:42] jhancock@cumin2002 reimage (PID 2138794) is awaiting input [22:16:50] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:18:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:19:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:19:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:19:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2024.codfw.wmnet with OS trixie [22:19:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11835154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2024.codfw.wmnet with OS trixie completed: - pc2024 (**PASS**) - Rem... [22:20:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:20:15] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1072.eqiad.wmnet with reason: host reimage [22:20:40] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:20:44] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11835155 (10Jhancock.wm) 05Open→03Resolved [22:21:05] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11835158 (10Jhancock.wm) @Marostegui these are complete! [22:23:20] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc1070.eqiad.wmnet with OS bookworm [22:23:28] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1070.eqiad.wmnet with OS bookworm executed with er... [22:24:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1072.eqiad.wmnet with reason: host reimage [22:33:18] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1056.eqiad.wmnet with reason: host reimage [22:35:32] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1067.eqiad.wmnet with OS bookworm [22:35:41] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835182 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1067.eqiad.wmnet with OS bookworm [22:36:51] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1067.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:37:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:38:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2251.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:38:06] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc1067.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:38:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:38:22] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1067.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:38:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2253.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:39:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1056.eqiad.wmnet with reason: host reimage [22:40:34] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [22:40:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [22:40:51] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1072.eqiad.wmnet with OS bookworm [22:40:58] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835190 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1072.eqiad.wmnet with OS bookworm completed: - mc1... [22:41:21] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:42:50] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835191 (10Jclark-ctr) [22:45:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1067.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:48:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:48:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:49:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:49:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2251.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:49:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2253.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:50:35] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mc1070.eqiad.wmnet with OS bookworm [22:50:40] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835200 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mc1070.eqiad.wmnet with OS bookworm [22:51:05] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:54:10] (03PS1) 10Ryan Kemper: cloudelastic1012: full common_settings override for OS2 [puppet] - 10https://gerrit.wikimedia.org/r/1274134 (https://phabricator.wikimedia.org/T422860) [22:54:43] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1274134 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [22:56:20] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [22:56:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [22:56:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1056.eqiad.wmnet with OS bookworm [22:56:41] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835205 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1056.eqiad.wmnet with OS bookworm completed: - mc1... [22:57:11] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835206 (10Jclark-ctr) [22:57:21] jclark@cumin1003 provision (PID 2567031) is awaiting input [22:57:45] (03CR) 10Ryan Kemper: "PCC shows -BindPaths=/usr/share/opensearch/plugins/ltr as desired; merging" [puppet] - 10https://gerrit.wikimedia.org/r/1274134 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [22:57:46] (03CR) 10Ryan Kemper: [C:03+2] cloudelastic1012: full common_settings override for OS2 [puppet] - 10https://gerrit.wikimedia.org/r/1274134 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [23:00:27] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mc1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:01:58] (03PS1) 10Dzahn: etherpad: avoid adding warning message multiple times [puppet] - 10https://gerrit.wikimedia.org/r/1274138 (https://phabricator.wikimedia.org/T420793) [23:04:12] (03CR) 10Dzahn: [C:03+2] etherpad: avoid adding warning message multiple times [puppet] - 10https://gerrit.wikimedia.org/r/1274138 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [23:05:26] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1067.eqiad.wmnet with reason: host reimage [23:07:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:08:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1067.eqiad.wmnet with reason: host reimage [23:10:23] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [23:10:27] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [23:21:33] (03PS1) 10Dzahn: etherpad: remove code to insert warning message [puppet] - 10https://gerrit.wikimedia.org/r/1274151 (https://phabricator.wikimedia.org/T420793) [23:22:49] (03PS2) 10Dzahn: etherpad: remove code to insert warning message [puppet] - 10https://gerrit.wikimedia.org/r/1274151 (https://phabricator.wikimedia.org/T420793) [23:23:10] (03CR) 10Dzahn: [C:03+2] etherpad: remove code to insert warning message [puppet] - 10https://gerrit.wikimedia.org/r/1274151 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [23:24:38] (03CR) 10Dzahn: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1274151 (https://phabricator.wikimedia.org/T420793) (owner: 10Dzahn) [23:24:49] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1070.eqiad.wmnet with reason: host reimage [23:25:56] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [23:25:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:25:59] Deployment linkrecommendation-internal in linkrecommendation at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [23:25:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:26:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [23:26:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1067.eqiad.wmnet with OS bookworm [23:26:25] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835283 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1067.eqiad.wmnet with OS bookworm completed: - mc1... [23:26:41] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835284 (10Jclark-ctr) [23:31:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1070.eqiad.wmnet with reason: host reimage [23:40:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1274153 [23:40:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1274153 (owner: 10TrainBranchBot) [23:48:13] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [23:51:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1274153 (owner: 10TrainBranchBot) [23:51:19] jclark@cumin1003 reimage (PID 2575711) is awaiting input [23:52:24] (03PS1) 10Ecarg: Wikifunctions: add helm values for function-evaluator in Rust [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) [23:53:05] (03CR) 10Ecarg: Wikifunctions: add helm values for function-evaluator in Rust (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [23:55:06] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [23:55:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1070.eqiad.wmnet with OS bookworm [23:55:15] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835321 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mc1070.eqiad.wmnet with OS bookworm completed: - mc1... [23:55:31] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835324 (10Jclark-ctr) [23:56:09] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11835325 (10Jclark-ctr) 05Open→03Resolved @jijiki install is finished