[00:04:38] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:08:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:09:38] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:26:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:38] FIRING: [19x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:44:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:49:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:54:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:59:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:59:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:09:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1277283 [01:09:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1277283 (owner: 10TrainBranchBot) [01:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:20:28] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1277283 (owner: 10TrainBranchBot) [01:23:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [01:23:44] Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ... [01:23:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [01:24:38] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:29:38] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:34:38] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:39:38] FIRING: [19x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:43:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [01:43:44] Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ... [01:43:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [01:44:38] FIRING: [19x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:00:40] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 6d 11h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [02:07:17] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 36s) [02:09:19] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:38] FIRING: [19x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:34:19] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:04:38] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:14:39] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:19:38] FIRING: [20x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:34:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:34:39] FIRING: [17x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:39:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:44:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:52:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:57:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:04:38] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:08:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:26:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:38] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:34:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:34:42] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:44:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:49:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:54:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:54:34] (03CR) 10Marostegui: Revert "db1244: disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/1277235 (owner: 10Marostegui) [04:54:36] (03CR) 10Marostegui: [C:03+2] Revert "db1244: disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/1277235 (owner: 10Marostegui) [04:55:10] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1244: After hw issues [04:55:12] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1244: After hw issues [04:55:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1244: After hw issues [04:55:42] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11859189 (10Marostegui) Host being repooled. [04:59:39] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:59:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:09:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:12:20] (03PS1) 10Marostegui: db1200,db2178: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277288 (https://phabricator.wikimedia.org/T424323) [05:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:13:47] (03CR) 10Marostegui: [C:03+2] db1200,db2178: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277288 (https://phabricator.wikimedia.org/T424323) (owner: 10Marostegui) [05:14:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1200.eqiad.wmnet with reason: Reimage to Trixie [05:14:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1200: Reimage to Trixie [05:14:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2178.codfw.wmnet with reason: Reimage to Trixie [05:14:51] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2178: Reimage to Trixie [05:14:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1200: Reimage to Trixie [05:15:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2178: Reimage to Trixie [05:15:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1200.eqiad.wmnet with OS trixie [05:16:00] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2178.codfw.wmnet with OS trixie [05:19:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:24:38] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:29:57] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1200.eqiad.wmnet with reason: host reimage [05:34:35] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2178.codfw.wmnet with reason: host reimage [05:34:38] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:36:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1200.eqiad.wmnet with reason: host reimage [05:40:11] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp6002 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:40:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2178.codfw.wmnet with reason: host reimage [05:40:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1244: After hw issues [05:41:11] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp6002 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:44:39] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:46:22] (03PS1) 10Marostegui: Revert "db1200,db2178: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277289 [05:53:32] (03PS1) 10Abijeet Patro: TtmServer: Use lazyPush for job queue [extensions/Translate] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277290 (https://phabricator.wikimedia.org/T423779) [05:54:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Translate] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277290 (https://phabricator.wikimedia.org/T423779) (owner: 10Abijeet Patro) [05:56:58] (03CR) 10Marostegui: [C:03+2] Revert "db1200,db2178: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277289 (owner: 10Marostegui) [05:58:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1200.eqiad.wmnet with OS trixie [05:59:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:00:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1200: after reimage to trixie [06:03:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2178.codfw.wmnet with OS trixie [06:04:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:04:38] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 6d 7h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [06:05:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2178: after reimage to trixie [06:19:39] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:24:38] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:29:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:34:38] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:39:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:39:38] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:44:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:44:39] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:46:04] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1277175 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [06:46:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1200: after reimage to trixie [06:49:19] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:49:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:50:35] (03PS6) 10Muehlenhoff: profile::zookeeper::firewall: Also allow passing a list of hosts [puppet] - 10https://gerrit.wikimedia.org/r/1272766 [06:51:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2178: after reimage to trixie [06:54:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:57:55] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me, one doc suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [06:59:38] FIRING: [20x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:59:54] jouncebot: now [06:59:54] For the next 0 hour(s) and 0 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260426T0700) [07:00:05] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T0700). [07:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] ah. now. [07:00:16] I'll be deploying abijeet's change. [07:00:20] abijeet: around? [07:01:36] (03CR) 10Muehlenhoff: [C:03+2] Apply ncredir role to ncredir5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1277051 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:03:08] kart_, im there [07:03:25] cool [07:03:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/Translate] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277290 (https://phabricator.wikimedia.org/T423779) (owner: 10Abijeet Patro) [07:04:39] FIRING: [22x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:04:59] (03Merged) 10jenkins-bot: TtmServer: Use lazyPush for job queue [extensions/Translate] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277290 (https://phabricator.wikimedia.org/T423779) (owner: 10Abijeet Patro) [07:05:52] (03CR) 10Filippo Giunchedi: [C:03+2] kubeadm: quote kubectl arguments [puppet] - 10https://gerrit.wikimedia.org/r/1277065 (https://phabricator.wikimedia.org/T420565) (owner: 10Filippo Giunchedi) [07:05:56] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1277290|TtmServer: Use lazyPush for job queue (T423779)]] [07:06:00] T423779: Translating a page on Meta-Wiki didn't create the translated page - https://phabricator.wikimedia.org/T423779 [07:09:24] (03Abandoned) 10Muehlenhoff: rsyslog/toil: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276676 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [07:09:39] FIRING: [24x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:14:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:22:11] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1277290|TtmServer: Use lazyPush for job queue (T423779)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:22:16] T423779: Translating a page on Meta-Wiki didn't create the translated page - https://phabricator.wikimedia.org/T423779 [07:22:40] (03CR) 10Elukey: [C:03+2] cfssl::cert: add require for csr when swapping intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277175 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [07:22:49] (03PS1) 10KartikMistry: cxserver: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277294 (https://phabricator.wikimedia.org/T423002) [07:23:07] !log kartik@deploy1003 abi, kartik: Continuing with deployment [07:23:30] Going ahead ^ as no manual testing is possible here. [07:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:34:49] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277290|TtmServer: Use lazyPush for job queue (T423779)]] (duration: 28m 53s) [07:34:53] T423779: Translating a page on Meta-Wiki didn't create the translated page - https://phabricator.wikimedia.org/T423779 [07:37:03] (03PS1) 10Marostegui: db1155.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277296 (https://phabricator.wikimedia.org/T423834) [07:38:08] (03CR) 10Marostegui: [C:03+2] db1155.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277296 (https://phabricator.wikimedia.org/T423834) (owner: 10Marostegui) [07:38:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 12 hosts with reason: Sanitarium: reimage to Debian Trixie [07:38:49] !log Reimage db1155 (sanitarium host) lag to be expected on wikireplicas: s2, s4, s6, s7 T423834 [07:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:53] T423834: Migrate sanitarium hosts to Debian Trixie - https://phabricator.wikimedia.org/T423834 [07:39:38] FIRING: [24x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:41:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1155.eqiad.wmnet with OS trixie [07:41:50] PROBLEM - MariaDB Replica IO: s2 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3312 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:41:50] PROBLEM - MariaDB Replica IO: s6 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3316 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:41:52] PROBLEM - MariaDB Replica IO: s4 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3314 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:41:52] PROBLEM - MariaDB Replica IO: s7 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3317 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:42:48] ^ expected [07:43:10] I downtimed clouddb* but not and-redacteddb, doing it now [07:43:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Sanitarium: reimage to Debian Trixie [07:44:51] (03PS1) 10Marostegui: db1207,db2171: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277297 (https://phabricator.wikimedia.org/T424323) [07:46:22] (03CR) 10Marostegui: [C:03+2] db1207,db2171: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277297 (https://phabricator.wikimedia.org/T424323) (owner: 10Marostegui) [07:46:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1207.eqiad.wmnet with reason: Reimage to Trixie [07:46:51] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1207: Reimage to Trixie [07:47:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1207.eqiad.wmnet with reason: Reimage to Trixie [07:47:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2171.codfw.wmnet with reason: Reimage to Trixie [07:48:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2171: Reimage to Trixie [07:48:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2171: Reimage to Trixie [07:48:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1207: Reimage to Trixie [07:50:24] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1207.eqiad.wmnet with OS trixie [07:50:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2171.codfw.wmnet with OS trixie [07:52:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719#11859354 (10JMeybohm) @Jclark-ctr AIUI these hosts are currently racked in dedicated WMCS... [07:54:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:55:14] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1155.eqiad.wmnet with reason: host reimage [07:59:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:00:21] (03CR) 10JMeybohm: "TBH I would prefer this to use the service mesh for" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276872 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [08:01:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1155.eqiad.wmnet with reason: host reimage [08:03:48] (03PS1) 10Effie Mouzeli: mw-mcrouter: bump image and new config (eqiad+codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277300 (https://phabricator.wikimedia.org/T421360) [08:05:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1207.eqiad.wmnet with reason: host reimage [08:07:34] (03PS1) 10Muehlenhoff: Add ncredir5003/5004 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1277301 (https://phabricator.wikimedia.org/T421863) [08:07:55] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2171.codfw.wmnet with reason: host reimage [08:08:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:09:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:10:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1207.eqiad.wmnet with reason: host reimage [08:10:33] (03CR) 10Elukey: "Yep I had the same thought but I reasoned about the following:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276872 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [08:12:11] (03PS1) 10Muehlenhoff: Remove access for zoe [puppet] - 10https://gerrit.wikimedia.org/r/1277302 [08:12:35] (03PS2) 10Muehlenhoff: Remove access for zoe [puppet] - 10https://gerrit.wikimedia.org/r/1277302 [08:13:19] (03CR) 10CI reject: [V:04-1] Remove access for zoe [puppet] - 10https://gerrit.wikimedia.org/r/1277302 (owner: 10Muehlenhoff) [08:13:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2171.codfw.wmnet with reason: host reimage [08:14:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:14:47] (03CR) 10Elukey: [C:03+1] "Everything looks good to me, I am not sure why the CI diff shows only the proxy changes though. In eqiad it seems as if the other paramete" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277300 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [08:14:52] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11859402 (10MatthewVernon) 05Resolved→03Open Hi folks - this host paged again on 2026-04-25 around 15:40 (and on-call extended the downtime on it). [08:14:52] !log test gnmic 0.45.0 on netflow4003 - T416360 [08:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:56] T416360: gNMIc: investigate new "collector" command - https://phabricator.wikimedia.org/T416360 [08:15:42] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11859406 (10MatthewVernon) 05Open→03Resolved [08:16:16] (03PS3) 10Muehlenhoff: Remove access for zoe [puppet] - 10https://gerrit.wikimedia.org/r/1277302 [08:19:24] (03PS1) 10Marostegui: Revert "db1155.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277316 [08:19:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:19:39] FIRING: [22x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:19:49] RECOVERY - MariaDB Replica IO: s6 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:19:49] RECOVERY - MariaDB Replica IO: s2 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:19:51] RECOVERY - MariaDB Replica IO: s4 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:19:51] RECOVERY - MariaDB Replica IO: s7 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:19:57] (03CR) 10Marostegui: [C:03+2] Revert "db1155.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277316 (owner: 10Marostegui) [08:20:27] (03PS1) 10Marostegui: Revert "db1207,db2171: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277325 [08:21:24] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1277302 (owner: 10Muehlenhoff) [08:22:39] (03PS2) 10Effie Mouzeli: (DNM) site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275502 (https://phabricator.wikimedia.org/T418261) [08:23:06] (03CR) 10Muehlenhoff: [C:03+2] Remove access for zoe [puppet] - 10https://gerrit.wikimedia.org/r/1277302 (owner: 10Muehlenhoff) [08:23:46] (03PS3) 10Effie Mouzeli: site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275502 (https://phabricator.wikimedia.org/T418261) [08:24:29] (03CR) 10Marostegui: [C:03+2] Revert "db1207,db2171: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277325 (owner: 10Marostegui) [08:25:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1155.eqiad.wmnet with OS trixie [08:30:50] (03CR) 10Ayounsi: [C:03+1] Add ncredir5003/5004 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1277301 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [08:34:15] (03CR) 10Jelto: "I accidentally implemented the same patch in I96218fcff1a86228d149c112d928bd92aef8cdd8. Let's discuss later today how that downtime should" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277115 (https://phabricator.wikimedia.org/T424175) (owner: 10Arnaudb) [08:35:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1207.eqiad.wmnet with OS trixie [08:35:54] (03PS1) 10Kosta Harlan: wmgPrivilegedGroups/wmgPrivilegedGlobalGroups: Update to include temporary account IP viewers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277426 [08:36:19] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Zoe out of all services on: 2436 hosts [08:37:12] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1207: after reimage to trixie [08:37:16] (03CR) 10Arnaudb: "I think we can drop this change, yours is simpler :-)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277115 (https://phabricator.wikimedia.org/T424175) (owner: 10Arnaudb) [08:37:32] (03Abandoned) 10Arnaudb: gitlab: silence SystemdUnitFailed alert after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1277115 (https://phabricator.wikimedia.org/T424175) (owner: 10Arnaudb) [08:37:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11859538 (10MLechvien-WMF) > @Clement_Goubert i am having issues with these failing to image this is error on console. They might be missing a partman @jijiki could yo... [08:37:41] (03CR) 10Muehlenhoff: site.pp: add role for rdb2011 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1275502 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [08:39:26] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 723825440 and 108 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:40:22] (03CR) 10Muehlenhoff: [C:03+2] Add ncredir5003/5004 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1277301 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [08:40:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2171.codfw.wmnet with OS trixie [08:41:55] (03PS4) 10Effie Mouzeli: site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275502 (https://phabricator.wikimedia.org/T418261) [08:41:59] (03CR) 10Jelto: "alert looks good to me, thank you! I looked at the historic values in Thanos https://w.wiki/MH58 and the threshold looks reasonable. But t" [alerts] - 10https://gerrit.wikimedia.org/r/1277048 (https://phabricator.wikimedia.org/T423601) (owner: 10Arnaudb) [08:42:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [08:42:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91544 and previous config saved to /var/cache/conftool/dbconfig/20260427-084217-fceratto.json [08:42:25] (03CR) 10Mszwarc: [C:03+1] wmgPrivilegedGroups/wmgPrivilegedGlobalGroups: Update to include temporary account IP viewers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277426 (owner: 10Kosta Harlan) [08:42:25] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:42:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2171: after reimage to trixie [08:43:23] (03CR) 10JMeybohm: "Yeah, sorry. I wasn't properly thinking this through." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276872 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [08:43:26] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [08:43:34] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:43:34] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:43:36] (03PS5) 10Effie Mouzeli: site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275502 (https://phabricator.wikimedia.org/T418261) [08:43:37] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [08:43:44] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:44:02] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:44:02] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2024.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2023.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:44:19] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:44:25] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [08:44:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2024.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2023.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:44:26] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [08:44:26] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.406 second response time https://wikitech.wikimedia.org/wiki/Swift [08:44:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:44:34] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [08:44:39] FIRING: [20x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:44:44] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:44:49] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:44:49] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:44:52] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [08:45:02] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:45:02] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:45:24] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [08:45:24] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [08:45:24] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Swift [08:45:24] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Swift [08:45:24] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:45:26] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3372392 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:45:34] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:45:34] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:45:34] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [08:45:34] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [08:45:36] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [08:45:41] (03CR) 10JMeybohm: [C:04-1] site.pp: add role for rdb2011 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1275502 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [08:45:42] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 5.500 second response time https://wikitech.wikimedia.org/wiki/Swift [08:45:52] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 573 bytes in 0.563 second response time https://wikitech.wikimedia.org/wiki/Swift [08:46:24] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [08:46:24] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [08:47:11] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [08:47:14] (03CR) 10Dpogorzelski: [C:03+1] Add gRPC port to kserve-inference NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276681 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos) [08:48:10] FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:48:59] (03Abandoned) 10Effie Mouzeli: site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275502 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [08:49:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:49:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [08:49:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [08:51:22] (03CR) 10Bartosz Wójtowicz: [C:03+2] "Let's try in staging :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T424049) (owner: 10Ilias Sarantopoulos) [08:51:42] (03CR) 10Bartosz Wójtowicz: [C:03+2] Add gRPC port to kserve-inference NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276681 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos) [08:52:24] (03CR) 10Klausman: [C:03+1] Add gRPC support to Istio ingress gateway for ML services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T424049) (owner: 10Ilias Sarantopoulos) [08:52:35] (03CR) 10Klausman: [C:03+1] Add gRPC port to kserve-inference NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276681 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos) [08:54:26] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 718437360 and 76 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:55:43] (03PS1) 10Effie Mouzeli: site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1277429 (https://phabricator.wikimedia.org/T418261) [08:55:56] !incidents [08:55:57] 7873 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [08:56:20] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277429 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [08:57:25] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3354184 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:58:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2019.codfw.wmnet, ms-fe2021.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2023.codfw.wmnet, ms-fe2022.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2024.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:58:35] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:58:35] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:58:47] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:58:47] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:59:10] (03Merged) 10jenkins-bot: Add gRPC support to Istio ingress gateway for ML services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T424049) (owner: 10Ilias Sarantopoulos) [08:59:25] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [08:59:25] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [08:59:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:59:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:59:37] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [08:59:39] (03CR) 10Elukey: "Okok, this is along the lines of what I wanted to do (either that or call the ingress endpoint directly). IIUC this patch is ok to go, bas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276872 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [08:59:43] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 5.381 second response time https://wikitech.wikimedia.org/wiki/Swift [08:59:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:02:58] (03Merged) 10jenkins-bot: Add gRPC port to kserve-inference NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276681 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos) [09:03:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11859808 (10jijiki) >>! In T418916#11859537, @MLechvien-WMF wrote: >> @Clement_Goubert i am having issues with these failing to image this is error on console. They mig... [09:04:02] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Add sva to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277256 (https://phabricator.wikimedia.org/T407106) (owner: 10HakanIST) [09:04:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:04:39] FIRING: [20x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:09:38] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:10:04] (03PS1) 10Marostegui: db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277433 (https://phabricator.wikimedia.org/T423834) [09:10:18] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=ncredir5003.eqsin.wmnet [09:10:26] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=ncredir5004.eqsin.wmnet [09:10:53] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir5003.eqsin.wmnet [09:10:55] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:10:57] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir5004.eqsin.wmnet [09:11:09] !log Reimage db1154 (sanitarium host) lag to be expected on wikireplicas: s, s3, s5, s8 x3 T423834 [09:11:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 13 hosts with reason: Sanitarium: reimage to Debian Trixie [09:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:14] T423834: Migrate sanitarium hosts to Debian Trixie - https://phabricator.wikimedia.org/T423834 [09:11:27] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:11:36] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:11:58] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:12:25] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [09:12:35] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:12:35] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:13:10] FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:13:21] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1154.eqiad.wmnet with OS trixie [09:13:25] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [09:13:25] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [09:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:14:39] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:16:58] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:10] FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:19:39] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:20:32] (03PS1) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) [09:20:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [09:20:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:21:25] (03CR) 10Marostegui: [C:03+2] db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277433 (https://phabricator.wikimedia.org/T423834) (owner: 10Marostegui) [09:22:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [09:22:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1207: after reimage to trixie [09:22:51] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:24:34] FIRING: [237x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:25:16] (03PS1) 10Elukey: envoyproxy: trigger the envoy's config re-creation if deleted [puppet] - 10https://gerrit.wikimedia.org/r/1277438 (https://phabricator.wikimedia.org/T420993) [09:25:26] (03PS2) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) [09:25:38] (03PS1) 10Marostegui: Revert "db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277439 [09:26:18] (03CR) 10JMeybohm: [C:03+1] "yeah, sorry for the noise." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276872 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [09:26:38] (03CR) 10JMeybohm: [C:03+1] charts: add ingress support to function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276873 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [09:27:41] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:27:59] (03CR) 10CI reject: [V:04-1] envoyproxy: trigger the envoy's config re-creation if deleted [puppet] - 10https://gerrit.wikimedia.org/r/1277438 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:28:14] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:28:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2171: after reimage to trixie [09:28:19] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1154.eqiad.wmnet with reason: host reimage [09:29:21] (03PS3) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) [09:29:51] (03PS2) 10Elukey: envoyproxy: trigger the envoy's config re-creation if deleted [puppet] - 10https://gerrit.wikimedia.org/r/1277438 (https://phabricator.wikimedia.org/T420993) [09:30:39] (03PS3) 10Arnaudb: gerrit: predict_linear alert for diskspace [alerts] - 10https://gerrit.wikimedia.org/r/1277048 (https://phabricator.wikimedia.org/T423601) [09:34:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [09:35:20] (03PS4) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) [09:35:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1154.eqiad.wmnet with reason: host reimage [09:38:45] FIRING: [2x] ProbeDown: Service wdqs2026:443 has failed probes (http_wdqs_internal_scholarly_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2026:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:39:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [09:39:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:39:33] (03CR) 10JMeybohm: [C:03+1] "We should ensure in staging that this does the right thing" [puppet] - 10https://gerrit.wikimedia.org/r/1272537 (https://phabricator.wikimedia.org/T365687) (owner: 10Muehlenhoff) [09:40:13] (03CR) 10Elukey: "To keep archives happy - we decided not to proceed in this way since it may become not super intuitive to know what's happening in the pup" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:40:22] (03Abandoned) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:40:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [09:40:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:41:01] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [09:41:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T419635)', diff saved to https://phabricator.wikimedia.org/P91556 and previous config saved to /var/cache/conftool/dbconfig/20260427-094109-fceratto.json [09:41:23] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:41:41] (03CR) 10JMeybohm: [C:03+1] "LGTM modulo what @ltoscano@wikimedia.org said 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277300 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [09:42:38] !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir5001.eqsin.wmnet [09:42:42] !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir5002.eqsin.wmnet [09:42:48] (03PS2) 10Effie Mouzeli: mw-mcrouter: bump image and new config (eqiad+codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277300 (https://phabricator.wikimedia.org/T421360) [09:43:49] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [09:44:39] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:45:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1277438 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:45:51] (03PS1) 10Michael Große: stats(CreateAccount): record baseline data for opening rates [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277445 (https://phabricator.wikimedia.org/T419916) [09:45:59] (03PS1) 10Michael Große: stats(CreateAccount): ignore overridden experiments [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277446 (https://phabricator.wikimedia.org/T419916) [09:46:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277445 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [09:46:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277446 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [09:47:47] (03CR) 10Muehlenhoff: [C:03+2] Make doh5003/doh5004 wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1276656 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [09:48:18] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1159: Repooling [09:49:02] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1161: Repooling [09:49:39] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:49:40] (03PS3) 10Elukey: envoyproxy: trigger the envoy's config re-creation if deleted [puppet] - 10https://gerrit.wikimedia.org/r/1277438 (https://phabricator.wikimedia.org/T420993) [09:49:40] (03PS1) 10Elukey: Move netbox and presto to the new PKI intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277449 (https://phabricator.wikimedia.org/T420993) [09:52:33] (03CR) 10JMeybohm: [C:03+1] "Nice, LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [09:52:36] (03PS2) 10TheDJ: Implement remaining, more rare, orientations [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1270141 (https://phabricator.wikimedia.org/T424495) [09:52:37] (03CR) 10Marostegui: [C:03+2] Revert "db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277439 (owner: 10Marostegui) [09:52:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277426 (owner: 10Kosta Harlan) [09:52:52] (03CR) 10JMeybohm: [C:04-1] Remove k8s version from all services (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [09:52:54] moritzm: can I merge your change? [09:53:14] !log jiji@deploy1003 Locking from deployment [ALL REPOSITORIES]: Upgrading mw-mcrouter - effie [09:53:28] marostegui: please do [09:53:36] moritzm: doing it [09:53:39] cheers [09:53:45] RESOLVED: [2x] ProbeDown: Service wdqs2026:443 has failed probes (http_wdqs_internal_scholarly_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2026:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:56:09] (03CR) 10CI reject: [V:04-1] stats(CreateAccount): record baseline data for opening rates [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277445 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [09:56:16] (03CR) 10Jforrester: [C:03+1] Implement remaining, more rare, orientations [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1270141 (https://phabricator.wikimedia.org/T424495) (owner: 10TheDJ) [09:56:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:37] (03CR) 10CI reject: [V:04-1] stats(CreateAccount): ignore overridden experiments [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277446 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [09:59:16] (03PS2) 10Elukey: Move netbox and presto to the new PKI intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277449 (https://phabricator.wikimedia.org/T420993) [09:59:16] (03PS4) 10Elukey: envoyproxy: trigger the envoy's config re-creation if deleted [puppet] - 10https://gerrit.wikimedia.org/r/1277438 (https://phabricator.wikimedia.org/T420993) [09:59:17] (03PS1) 10Elukey: profile::tlsproxy::envoy: add condition to cfss base options [puppet] - 10https://gerrit.wikimedia.org/r/1277452 (https://phabricator.wikimedia.org/T420993) [09:59:37] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:59:43] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [09:59:53] (03PS1) 10Brouberol: Remove mw-page-edit-type-enrich-next from flink tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277453 (https://phabricator.wikimedia.org/T424364) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1000) [10:00:05] effie: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:00:37] (03PS1) 10Brouberol: Remove the mw-page-edit-type-enrich-next kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1277454 (https://phabricator.wikimedia.org/T424364) [10:00:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1154.eqiad.wmnet with OS trixie [10:02:19] I have locked scap but I will hold off rollout the update till oncallers give me the go ahead [10:02:43] (03CR) 10AKhatun: [C:03+1] Remove mw-page-edit-type-enrich-next from flink tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277453 (https://phabricator.wikimedia.org/T424364) (owner: 10Brouberol) [10:02:59] (03CR) 10AKhatun: [C:03+1] Remove the mw-page-edit-type-enrich-next kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1277454 (https://phabricator.wikimedia.org/T424364) (owner: 10Brouberol) [10:03:37] (03CR) 10Brouberol: [C:03+2] Remove the mw-page-edit-type-enrich-next kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1277454 (https://phabricator.wikimedia.org/T424364) (owner: 10Brouberol) [10:03:38] (03PS1) 10Kosta Harlan: hCaptcha: enable for mobile apps account creation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277455 (https://phabricator.wikimedia.org/T412132) [10:04:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277455 (https://phabricator.wikimedia.org/T412132) (owner: 10Kosta Harlan) [10:04:37] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:04:39] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:04:45] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 117.39 ms [10:04:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [10:04:54] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 6d 3h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [10:09:22] (03PS1) 10Michael Große: bundlesize: unset mediaiwiki.base max size [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277456 (https://phabricator.wikimedia.org/T424324) [10:09:39] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:09:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277456 (https://phabricator.wikimedia.org/T424324) (owner: 10Michael Große) [10:10:41] (03CR) 10Brouberol: [C:03+2] Remove mw-page-edit-type-enrich-next from flink tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277453 (https://phabricator.wikimedia.org/T424364) (owner: 10Brouberol) [10:11:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1277452 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [10:12:00] (03CR) 10Elukey: [C:03+2] charts: add ingress support to function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276873 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [10:12:10] (03CR) 10Elukey: [C:03+2] services: enable ingress for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276872 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [10:13:07] (03PS2) 10Michael Große: stats(CreateAccount): ignore overridden experiments [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277446 (https://phabricator.wikimedia.org/T419916) [10:13:22] (03PS2) 10Michael Große: stats(CreateAccount): record baseline data for opening rates [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277445 (https://phabricator.wikimedia.org/T419916) [10:14:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [10:15:10] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: sync [10:15:29] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1159: Repooling [10:15:30] !log bump OSPF metric of ulsfo-codfw transport to 750 [10:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:36] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1159: Repooling [10:15:38] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1161: Repooling [10:15:44] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1161: Repooling [10:15:45] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1159: Repooling [10:15:48] (03PS1) 10Kosta Harlan: hCaptcha: Emit load_duration once per load and add load_attempts [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277457 (https://phabricator.wikimedia.org/T421204) [10:15:53] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1161: Repooling [10:16:31] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [10:16:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:17:33] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:19:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [10:21:42] (03PS2) 10Kosta Harlan: hCaptcha: Emit load_duration once per load and add load_attempts [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277457 (https://phabricator.wikimedia.org/T421204) [10:22:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277457 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [10:24:03] PROBLEM - Bird Internet Routing Daemon on doh5003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:24:16] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1185: Repooling [10:24:43] (03CR) 10CI reject: [V:04-1] stats(CreateAccount): ignore overridden experiments [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277446 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [10:26:08] (03CR) 10CI reject: [V:04-1] stats(CreateAccount): record baseline data for opening rates [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277445 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [10:26:08] (03PS1) 10Muehlenhoff: ganeti: Make the cfssl label configurable via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1277458 (https://phabricator.wikimedia.org/T420993) [10:28:24] 06SRE, 06Infrastructure-Foundations, 06serviceops-deprecated, 07ARM support: Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session) - https://phabricator.wikimedia.org/T320811#11860323 (10zeljkofilipin) [10:29:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277458 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [10:31:25] (03CR) 10Mszwarc: [C:03+1] hCaptcha: Emit load_duration once per load and add load_attempts [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277457 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [10:33:26] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:33:26] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:34:26] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:34:26] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:34:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:35:42] FIRING: CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-esams (185.15.59.145) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr1-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr2-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:35:53] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1185: Repooling [10:36:16] (03PS1) 10Marostegui: db1210,db2157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277463 (https://phabricator.wikimedia.org/T424323) [10:36:18] (03CR) 10Elukey: [C:03+1] ganeti: Make the cfssl label configurable via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1277458 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [10:36:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [10:36:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91565 and previous config saved to /var/cache/conftool/dbconfig/20260427-103635-fceratto.json [10:36:40] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:37:09] (03CR) 10Marostegui: [C:03+2] db1210,db2157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277463 (https://phabricator.wikimedia.org/T424323) (owner: 10Marostegui) [10:37:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1210.eqiad.wmnet with reason: Reimage to Trixie [10:37:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1210: Reimage to Trixie [10:37:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2157.codfw.wmnet with reason: Reimage to Trixie [10:37:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2157: Reimage to Trixie [10:38:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2157: Reimage to Trixie [10:38:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1210: Reimage to Trixie [10:38:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91568 and previous config saved to /var/cache/conftool/dbconfig/20260427-103845-fceratto.json [10:39:37] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:39:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1210.eqiad.wmnet with OS trixie [10:39:47] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2157.codfw.wmnet with OS trixie [10:40:42] RESOLVED: CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-esams (185.15.59.145) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr1-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr2-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:44:44] (03CR) 10Muehlenhoff: [C:03+2] ganeti: Make the cfssl label configurable via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1277458 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [10:48:26] (03CR) 10Ilias Sarantopoulos: [C:03+1] Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [10:48:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91569 and previous config saved to /var/cache/conftool/dbconfig/20260427-104853-fceratto.json [10:49:58] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1277449 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [10:54:35] (03PS1) 10Atsuko: deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) [10:55:06] (03CR) 10CI reject: [V:04-1] deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [10:55:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage [10:56:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:57:29] (03PS1) 10Cathal Mooney: Add pint ignore rules for CoreRouterInterfaceDropPercent [alerts] - 10https://gerrit.wikimedia.org/r/1277472 [10:58:07] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2157.codfw.wmnet with reason: host reimage [10:59:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91570 and previous config saved to /var/cache/conftool/dbconfig/20260427-105901-fceratto.json [10:59:25] (03PS1) 10Marostegui: Revert "db1210,db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277474 [10:59:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [10:59:51] Lumen (442550281) {#3867}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F3%2F2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [11:01:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage [11:04:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2157.codfw.wmnet with reason: host reimage [11:04:47] !incidents [11:04:47] 7873 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [11:04:47] 7874 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [11:09:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91571 and previous config saved to /var/cache/conftool/dbconfig/20260427-110909-fceratto.json [11:09:14] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:09:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:09:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:09:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P91572 and previous config saved to /var/cache/conftool/dbconfig/20260427-110935-fceratto.json [11:10:40] (03PS1) 10Muehlenhoff: Move ganeti-test to the 2026 PKI discovery intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277484 (https://phabricator.wikimedia.org/T420993) [11:11:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P91573 and previous config saved to /var/cache/conftool/dbconfig/20260427-111145-fceratto.json [11:13:26] (03PS1) 10Btullis: opensearch-cluster: Istio configure bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277486 (https://phabricator.wikimedia.org/T424007) [11:15:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277484 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [11:15:56] (03PS1) 10Muehlenhoff: doh on routed Ganeti/eqsin: Don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1277489 (https://phabricator.wikimedia.org/T421863) [11:16:07] (03PS2) 10Muehlenhoff: doh on routed Ganeti/eqsin: Don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1277489 (https://phabricator.wikimedia.org/T421863) [11:19:38] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:19:55] (03PS1) 10Esanders: Add experiment for suggestion mode beta feature [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277496 (https://phabricator.wikimedia.org/T422740) [11:20:11] (03PS1) 10Muehlenhoff: mediawiki::php: Fix version of php-common if ICU72 is enabled [puppet] - 10https://gerrit.wikimedia.org/r/1277497 (https://phabricator.wikimedia.org/T422964) [11:20:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277496 (https://phabricator.wikimedia.org/T422740) (owner: 10Esanders) [11:21:34] (03CR) 10Marostegui: [C:03+2] Revert "db1210,db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277474 (owner: 10Marostegui) [11:21:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P91574 and previous config saved to /var/cache/conftool/dbconfig/20260427-112153-fceratto.json [11:22:34] (03CR) 10Ayounsi: [C:03+1] doh on routed Ganeti/eqsin: Don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1277489 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [11:23:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [11:23:51] Lumen (442550281) {#3867}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F3%2F2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [11:23:52] (03PS2) 10Cathal Mooney: Management routers: remove ospf conf and set asn [homer/public] - 10https://gerrit.wikimedia.org/r/1267170 (https://phabricator.wikimedia.org/T421238) [11:24:00] !ack [11:24:02] 7875 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [11:24:39] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:25:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1210.eqiad.wmnet with OS trixie [11:25:31] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 988810392 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:26:12] (03CR) 10Cathal Mooney: Management routers: remove ospf conf and set asn (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1267170 (https://phabricator.wikimedia.org/T421238) (owner: 10Cathal Mooney) [11:27:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1210: after reimage to trixie [11:28:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2157.codfw.wmnet with OS trixie [11:30:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2157: after reimage to trixie [11:31:25] (03PS1) 10Michael Große: tests: Clear SiteNoticeAfter hook on SkinMinervaTest [skins/MinervaNeue] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277500 [11:31:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [skins/MinervaNeue] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277500 (owner: 10Michael Große) [11:32:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P91577 and previous config saved to /var/cache/conftool/dbconfig/20260427-113204-fceratto.json [11:32:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277497 (https://phabricator.wikimedia.org/T422964) (owner: 10Muehlenhoff) [11:32:31] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 24584 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:32:40] (03PS1) 10Jon Harald Søby: missing.php: Fix Wikiversity logo and improve dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277499 (https://phabricator.wikimedia.org/T424511) [11:33:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277499 (https://phabricator.wikimedia.org/T424511) (owner: 10Jon Harald Søby) [11:34:04] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:34:34] FIRING: [236x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:36:51] (03CR) 10Ayounsi: [C:03+1] Management routers: remove ospf conf and set asn [homer/public] - 10https://gerrit.wikimedia.org/r/1267170 (https://phabricator.wikimedia.org/T421238) (owner: 10Cathal Mooney) [11:39:21] (03CR) 10Ayounsi: [C:03+2] doh on routed Ganeti/eqsin: Don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1277489 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [11:41:27] (03PS1) 10Marostegui: db2211: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277505 (https://phabricator.wikimedia.org/T424323) [11:42:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P91578 and previous config saved to /var/cache/conftool/dbconfig/20260427-114212-fceratto.json [11:42:19] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [11:42:19] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:42:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T419635)', diff saved to https://phabricator.wikimedia.org/P91580 and previous config saved to /var/cache/conftool/dbconfig/20260427-114227-fceratto.json [11:43:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [11:43:51] Lumen (442550281) {#3867}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F3%2F2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [11:43:54] (03CR) 10Elukey: [C:03+1] Move ganeti-test to the 2026 PKI discovery intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277484 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [11:44:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T419635)', diff saved to https://phabricator.wikimedia.org/P91581 and previous config saved to /var/cache/conftool/dbconfig/20260427-114437-fceratto.json [11:46:28] (03PS5) 10Elukey: envoyproxy: trigger the envoy's config re-creation if deleted [puppet] - 10https://gerrit.wikimedia.org/r/1277438 (https://phabricator.wikimedia.org/T420993) [11:46:28] (03PS2) 10Elukey: profile::tlsproxy::envoy: add condition to cfss base options [puppet] - 10https://gerrit.wikimedia.org/r/1277452 (https://phabricator.wikimedia.org/T420993) [11:46:28] (03PS3) 10Elukey: Move netbox and presto to the new PKI intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277449 (https://phabricator.wikimedia.org/T420993) [11:46:29] (03PS1) 10Elukey: profile::mediabackup: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) [11:46:31] (03PS1) 10Elukey: profile::puppetdb: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277507 (https://phabricator.wikimedia.org/T420993) [11:46:35] (03PS1) 10Elukey: profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) [11:46:39] (03PS1) 10Elukey: profile::hcaptcha: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277509 (https://phabricator.wikimedia.org/T420993) [11:46:43] (03PS1) 10Elukey: profile::dragonfly: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277510 (https://phabricator.wikimedia.org/T420993) [11:46:47] (03PS1) 10Elukey: profile::docker_registry: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277511 (https://phabricator.wikimedia.org/T420993) [11:46:51] (03PS1) 10Elukey: profile::cache::purge: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277512 (https://phabricator.wikimedia.org/T420993) [11:46:55] (03PS1) 10Elukey: profile::etcd::tlsproxy: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277513 (https://phabricator.wikimedia.org/T420993) [11:48:20] RECOVERY - Bird Internet Routing Daemon on doh5003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:50:59] moritzm: ^ [11:54:39] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:54:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P91583 and previous config saved to /var/cache/conftool/dbconfig/20260427-115445-fceratto.json [11:56:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:58:14] (03CR) 10Brouberol: [C:03+1] Deploy the new airflow version to the main instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275857 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:58:21] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the platform-eng instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275858 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:58:28] (03CR) 10Brouberol: [C:03+1] Deploy the new airflow version to the search instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275859 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:58:34] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the analytics-product instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275860 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:58:39] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the research instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275861 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:58:43] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the ml instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275862 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:58:49] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the sre instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275863 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:58:53] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the wmde instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275864 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:58:57] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the wikidata instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275865 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:59:12] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the fr-tech instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275866 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:59:19] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:59:22] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11860617 (10A_smart_kitten) [12:03:53] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:04:14] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:04:19] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:04:39] FIRING: [20x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:04:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P91586 and previous config saved to /var/cache/conftool/dbconfig/20260427-120453-fceratto.json [12:05:23] (03CR) 10Muehlenhoff: [C:03+2] Move ganeti-test to the 2026 PKI discovery intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277484 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [12:06:57] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] kserve: update to version 0.17 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253498 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [12:07:43] (03CR) 10Dpogorzelski: "yes, helmfile destroy+sync on edit-check" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275354 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [12:08:05] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: remove excludeIPRanges from cni config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275354 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [12:09:39] FIRING: [22x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:09:47] (03Merged) 10jenkins-bot: ml-serve: remove excludeIPRanges from cni config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275354 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [12:10:29] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:10:36] (03PS1) 10Elukey: services: add TLS SANs to the evaluators' mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) [12:12:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1210: after reimage to trixie [12:14:39] FIRING: [26x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:14:44] (03PS2) 10Elukey: services: change the default host to the evaluators' mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) [12:15:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T419635)', diff saved to https://phabricator.wikimedia.org/P91588 and previous config saved to /var/cache/conftool/dbconfig/20260427-121501-fceratto.json [12:15:09] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:15:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [12:15:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T419635)', diff saved to https://phabricator.wikimedia.org/P91589 and previous config saved to /var/cache/conftool/dbconfig/20260427-121529-fceratto.json [12:15:49] (03PS3) 10Elukey: services: Add TLS SANs to the evaluators' mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) [12:15:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2157: after reimage to trixie [12:17:06] (03CR) 10Cathal Mooney: [C:03+2] Management routers: remove ospf conf and set asn [homer/public] - 10https://gerrit.wikimedia.org/r/1267170 (https://phabricator.wikimedia.org/T421238) (owner: 10Cathal Mooney) [12:17:40] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2002.codfw.wmnet [12:18:44] (03Merged) 10jenkins-bot: Management routers: remove ospf conf and set asn [homer/public] - 10https://gerrit.wikimedia.org/r/1267170 (https://phabricator.wikimedia.org/T421238) (owner: 10Cathal Mooney) [12:19:39] FIRING: [26x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:22:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T419635)', diff saved to https://phabricator.wikimedia.org/P91591 and previous config saved to /var/cache/conftool/dbconfig/20260427-122213-fceratto.json [12:22:18] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:22:26] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:23:30] (03PS5) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) [12:23:58] (03CR) 10CI reject: [V:04-1] amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [12:24:28] (03PS6) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) [12:24:40] (03CR) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [12:24:54] (03CR) 10CI reject: [V:04-1] amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [12:26:44] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:27:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:27:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:27:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2002.codfw.wmnet [12:30:47] !log elukey@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:32:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P91592 and previous config saved to /var/cache/conftool/dbconfig/20260427-123221-fceratto.json [12:32:28] !log elukey@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:33:22] (03CR) 10Elukey: "The diff is a mess but I believe it is just a reordering issue.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [12:34:00] (03CR) 10JMeybohm: [C:03+1] profile::tlsproxy::envoy: add condition to cfss base options [puppet] - 10https://gerrit.wikimedia.org/r/1277452 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:34:39] FIRING: [24x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:35:01] (03CR) 10Marostegui: [C:03+2] db2211: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277505 (https://phabricator.wikimedia.org/T424323) (owner: 10Marostegui) [12:35:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2211.codfw.wmnet with reason: Reimage to Trixie [12:35:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2211: Reimage to Trixie [12:35:39] !log elukey@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:35:45] !log elukey@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:35:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2211: Reimage to Trixie [12:37:14] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2211.codfw.wmnet with OS trixie [12:37:28] !log jiji@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: Upgrading mw-mcrouter - effie (duration: 164m 14s) [12:39:39] FIRING: [23x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:39:58] (03CR) 10EMcFarland: [C:03+1] "Would it be possible to include a test with this change? Thanks." [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275543 (https://phabricator.wikimedia.org/T422907) (owner: 10Jdlrobson) [12:41:06] (03CR) 10JMeybohm: [C:03+1] profile::dragonfly: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277510 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:41:34] (03PS7) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) [12:41:42] (03CR) 10JMeybohm: [C:03+1] profile::docker_registry: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277511 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:42:02] (03CR) 10JMeybohm: [C:03+1] profile::etcd::tlsproxy: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277513 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:42:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P91594 and previous config saved to /var/cache/conftool/dbconfig/20260427-124228-fceratto.json [12:43:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1277507 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:46:13] (03CR) 10EMcFarland: [C:03+1] "Hi, please disregard the above comment; I missed that this change had already been merged to master and is targeting wmf/1.46.0-wmf.24." [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275543 (https://phabricator.wikimedia.org/T422907) (owner: 10Jdlrobson) [12:48:10] (03PS1) 10Marostegui: Revert "db2211: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277539 [12:48:26] (03PS3) 10Michael Große: stats(CreateAccount): record baseline data for opening rates [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277445 (https://phabricator.wikimedia.org/T419916) [12:48:43] (03CR) 10Thiemo Kreuz (WMDE): [C:03+2] tests: Clear SiteNoticeAfter hook on SkinMinervaTest [skins/MinervaNeue] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277500 (owner: 10Michael Große) [12:48:51] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] tests: Clear SiteNoticeAfter hook on SkinMinervaTest [skins/MinervaNeue] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277500 (owner: 10Michael Große) [12:49:24] (03PS3) 10Michael Große: stats(CreateAccount): ignore overridden experiments [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277446 (https://phabricator.wikimedia.org/T419916) [12:49:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1277449 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:50:57] (03CR) 10Arnaudb: [C:03+2] gerrit: update sync-instances cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb) [12:52:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T419635)', diff saved to https://phabricator.wikimedia.org/P91595 and previous config saved to /var/cache/conftool/dbconfig/20260427-125236-fceratto.json [12:52:42] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:52:54] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [12:53:02] (03CR) 10Elukey: [C:03+2] envoyproxy: trigger the envoy's config re-creation if deleted [puppet] - 10https://gerrit.wikimedia.org/r/1277438 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:53:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T419635)', diff saved to https://phabricator.wikimedia.org/P91596 and previous config saved to /var/cache/conftool/dbconfig/20260427-125301-fceratto.json [12:54:08] (03Merged) 10jenkins-bot: gerrit: update sync-instances cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb) [12:54:39] FIRING: [22x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:55:58] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2211.codfw.wmnet with reason: host reimage [12:56:39] (03CR) 10Brouberol: "Looks great, with one small nit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277486 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [12:57:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:57:36] (03CR) 10Brouberol: [C:03+1] mediawiki::php: Fix version of php-common if ICU72 is enabled [puppet] - 10https://gerrit.wikimedia.org/r/1277497 (https://phabricator.wikimedia.org/T422964) (owner: 10Muehlenhoff) [12:58:07] (03PS2) 10Atsuko: deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) [12:58:34] (03CR) 10Brouberol: [C:03+1] "LG, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1277449 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:58:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T419635)', diff saved to https://phabricator.wikimedia.org/P91597 and previous config saved to /var/cache/conftool/dbconfig/20260427-125834-fceratto.json [12:58:39] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:59:34] FIRING: [235x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:59:39] FIRING: [22x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:59:55] * Lucas_WMDE glares at the deployment calendar [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1300). [13:00:05] MichaelG_WMF, kostajh, edsanders, and Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] here [13:00:10] o/ [13:00:25] hi [13:00:54] o/ [13:01:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2211.codfw.wmnet with reason: host reimage [13:01:02] the core and Minerva change of mine are just about tests. They are prerequisites for the Growth changes to pass CI [13:01:16] I can deploy – let’s start with Jhs [13:01:33] :D [13:01:36] hi [13:01:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277499 (https://phabricator.wikimedia.org/T424511) (owner: 10Jon Harald Søby) [13:02:11] https://gerrit.wikimedia.org/r/c/1277457/ this can be synced with another wmf.24 patch [13:02:18] It doesn’t need to be tested on its own [13:02:25] I hope missing.php works with X-Wikimedia-Debug [13:02:29] but it sounds like it should, if it’s inside multiversion/ [13:02:47] surely the XWD routing happens before we hit any PHP inside the container image [13:02:53] (03Merged) 10jenkins-bot: missing.php: Fix Wikiversity logo and improve dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277499 (https://phabricator.wikimedia.org/T424511) (owner: 10Jon Harald Søby) [13:03:13] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1277499|missing.php: Fix Wikiversity logo and improve dark mode (T424511 T424512)]] [13:03:19] T424511: Broken logo thumbnail for missing Wikiversities - https://phabricator.wikimedia.org/T424511 [13:03:20] T424512: missing.php should have better dark mode support - https://phabricator.wikimedia.org/T424512 [13:03:23] The config patches I have can also be bundled with other config patches [13:04:34] FIRING: [235x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:05:29] Lucas_WMDE, looks good on mwdebug 👍 [13:06:21] (03CR) 10Jcrespo: "This is ok, but I would like to merge it while the service is under maintenance because the latency of what I mentioned on the ticket. sto" [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:06:21] Jhs: please don’t test before scap says so, it’s confusing :P [13:06:30] haha, sorry :P [13:06:56] !log lucaswerkmeister-wmde@deploy1003 jhsoby, lucaswerkmeister-wmde: Backport for [[gerrit:1277499|missing.php: Fix Wikiversity logo and improve dark mode (T424511 T424512)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:48] !log lucaswerkmeister-wmde@deploy1003 jhsoby, lucaswerkmeister-wmde: Continuing with deployment [13:08:20] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277445 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:08:24] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277446 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:08:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [13:08:27] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277456 (https://phabricator.wikimedia.org/T424324) (owner: 10Michael Große) [13:08:30] (03CR) 10Btullis: [C:03+2] Deploy the new airflow version to the main instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275857 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [13:08:31] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [skins/MinervaNeue] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277500 (owner: 10Michael Große) [13:08:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1247 (T419961)', diff saved to https://phabricator.wikimedia.org/P91598 and previous config saved to /var/cache/conftool/dbconfig/20260427-130832-fceratto.json [13:08:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P91599 and previous config saved to /var/cache/conftool/dbconfig/20260427-130842-fceratto.json [13:09:39] FIRING: [23x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:10:01] !log updating debdeploy on bullseye to 0.0.99.15 [13:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:32] (03Merged) 10jenkins-bot: Deploy the new airflow version to the main instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275857 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [13:12:16] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:12:20] random SpiderPig thought (cc dancy): for messages like “13:10:25 Waiting 20 seconds for canary traffic” a clock somewhere on screen would be useful 🤔 [13:12:48] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:12:50] getting the server time every second is probably overkill, but maybe SpiderPig could check if my system clock is reasonably in sync with the deployment server (modulo time zone) at page load time… [13:12:57] (03Merged) 10jenkins-bot: bundlesize: unset mediaiwiki.base max size [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277456 (https://phabricator.wikimedia.org/T424324) (owner: 10Michael Große) [13:13:16] (03CR) 10Marostegui: [C:03+2] Revert "db2211: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277539 (owner: 10Marostegui) [13:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:41] should I make a phab task for that or is it a stupid idea? ^^ [13:13:52] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277499|missing.php: Fix Wikiversity logo and improve dark mode (T424511 T424512)]] (duration: 10m 39s) [13:13:59] T424511: Broken logo thumbnail for missing Wikiversities - https://phabricator.wikimedia.org/T424511 [13:13:59] T424512: missing.php should have better dark mode support - https://phabricator.wikimedia.org/T424512 [13:14:12] deploying MichaelG_WMF’s changes next [13:14:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277445 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:14:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277446 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:14:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277500 (owner: 10Michael Große) [13:14:20] * MichaelG_WMF is ready to test [13:14:27] (as far as this can be tested) [13:14:34] FIRING: [235x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:14:39] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:15:07] Lucas_WMDE: I can self deploy when it's my turn [13:15:15] (03PS2) 10Btullis: opensearch-cluster: Istio configure bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277486 (https://phabricator.wikimedia.org/T424007) [13:15:16] ack [13:15:30] kostajh: how about you, do you want to self-deploy or do you need a deployer? [13:16:42] I can deploy myself [13:17:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T419961)', diff saved to https://phabricator.wikimedia.org/P91600 and previous config saved to /var/cache/conftool/dbconfig/20260427-131702-fceratto.json [13:18:05] ok [13:18:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P91601 and previous config saved to /var/cache/conftool/dbconfig/20260427-131849-fceratto.json [13:19:09] (03Merged) 10jenkins-bot: tests: Clear SiteNoticeAfter hook on SkinMinervaTest [skins/MinervaNeue] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277500 (owner: 10Michael Große) [13:19:10] (03Merged) 10jenkins-bot: stats(CreateAccount): ignore overridden experiments [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277446 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:19:12] !log installing Bind security updates (client-side tools/libs) [13:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:34] FIRING: [235x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:20:51] (03PS5) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) [13:21:00] (03PS3) 10Brouberol: admin: add suecarmol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1248395 (https://phabricator.wikimedia.org/T418664) (owner: 10Gehel) [13:21:58] (03CR) 10Jelto: [C:03+1] "lgtm, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1277048 (https://phabricator.wikimedia.org/T423601) (owner: 10Arnaudb) [13:22:56] (03Merged) 10jenkins-bot: stats(CreateAccount): record baseline data for opening rates [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277445 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:23:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1277513 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:23:08] (03CR) 10Elukey: [C:03+2] profile::tlsproxy::envoy: add condition to cfss base options [puppet] - 10https://gerrit.wikimedia.org/r/1277452 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:23:13] (03CR) 10Brouberol: [C:03+1] "LG!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277486 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [13:23:15] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1277445|stats(CreateAccount): record baseline data for opening rates (T419916)]], [[gerrit:1277446|stats(CreateAccount): ignore overridden experiments (T419916)]], [[gerrit:1277456|bundlesize: unset mediaiwiki.base max size (T424324)]], [[gerrit:1277500|tests: Clear SiteNoticeAfter hook on SkinMinervaTest]] [13:23:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2211.codfw.wmnet with OS trixie [13:23:21] T419916: [V1 experiment release] Redesign mobile web account creation form following Codex guidelines - https://phabricator.wikimedia.org/T419916 [13:23:22] T424324: MediaWiki\Tests\Structure\BundleSizeTest::testBundleSize with data set "mediawiki.base" (array('N/A', 'mediawiki.base', '17.0 kB')) - https://phabricator.wikimedia.org/T424324 [13:23:25] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1248395 (https://phabricator.wikimedia.org/T418664) (owner: 10Gehel) [13:23:29] (03CR) 10Arnaudb: [C:03+2] gerrit: predict_linear alert for diskspace (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1277048 (https://phabricator.wikimedia.org/T423601) (owner: 10Arnaudb) [13:24:22] (03CR) 10Brouberol: [C:03+2] admin: add suecarmol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1248395 (https://phabricator.wikimedia.org/T418664) (owner: 10Gehel) [13:24:41] (03PS6) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) [13:24:51] !log lucaswerkmeister-wmde@deploy1003 migr, lucaswerkmeister-wmde: Backport for [[gerrit:1277445|stats(CreateAccount): record baseline data for opening rates (T419916)]], [[gerrit:1277446|stats(CreateAccount): ignore overridden experiments (T419916)]], [[gerrit:1277456|bundlesize: unset mediaiwiki.base max size (T424324)]], [[gerrit:1277500|tests: Clear SiteNoticeAfter hook on SkinMinervaTest]] synced to the testservers ( [13:24:51] see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:24:56] (03Merged) 10jenkins-bot: gerrit: predict_linear alert for diskspace [alerts] - 10https://gerrit.wikimedia.org/r/1277048 (https://phabricator.wikimedia.org/T423601) (owner: 10Arnaudb) [13:25:04] * MichaelG_WMF is looking [13:25:37] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2211: after reimage to trixie [13:26:56] (03CR) 10Ssingh: [C:03+1] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1277509 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:27:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P91603 and previous config saved to /var/cache/conftool/dbconfig/20260427-132710-fceratto.json [13:27:40] (03CR) 10Elukey: "Tobias worked on a script on Friday to detect when the /dev/dri devices come up, since it may take up to a couple of minutes after udev st" [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [13:28:35] Lucas_WMDE: I'm not seeing any errors when creating a new account, so I think we're good to move forward 👍 [13:28:44] !log lucaswerkmeister-wmde@deploy1003 migr, lucaswerkmeister-wmde: Continuing with deployment [13:28:46] alright, thanks! [13:28:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T419635)', diff saved to https://phabricator.wikimedia.org/P91604 and previous config saved to /var/cache/conftool/dbconfig/20260427-132857-fceratto.json [13:29:01] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:29:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [13:29:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1210 (T419635)', diff saved to https://phabricator.wikimedia.org/P91605 and previous config saved to /var/cache/conftool/dbconfig/20260427-132921-fceratto.json [13:30:24] (03CR) 10Elukey: [C:03+2] Move netbox and presto to the new PKI intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277449 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:32:56] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277445|stats(CreateAccount): record baseline data for opening rates (T419916)]], [[gerrit:1277446|stats(CreateAccount): ignore overridden experiments (T419916)]], [[gerrit:1277456|bundlesize: unset mediaiwiki.base max size (T424324)]], [[gerrit:1277500|tests: Clear SiteNoticeAfter hook on SkinMinervaTest]] (duration: 09m 40s) [13:33:02] T419916: [V1 experiment release] Redesign mobile web account creation form following Codex guidelines - https://phabricator.wikimedia.org/T419916 [13:33:02] T424324: MediaWiki\Tests\Structure\BundleSizeTest::testBundleSize with data set "mediawiki.base" (array('N/A', 'mediawiki.base', '17.0 kB')) - https://phabricator.wikimedia.org/T424324 [13:34:34] FIRING: [235x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:35:24] kostajh: over to you [13:35:42] and maybe edsanders if you want to deploy those changes together? [13:35:58] thanks [13:37:12] I'll just do one as I'm guiding our new engineer through the process [13:37:18] yeah, go ahead I think [13:37:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260427-133719-fceratto.json [13:37:50] hopefully WikimediaEvents gate-and-submit doesn’t take too long [13:37:55] Ok, let me know when you’re done [13:38:13] (03PS2) 10Elukey: profile::puppetdb: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277507 (https://phabricator.wikimedia.org/T420993) [13:38:13] (03PS2) 10Elukey: profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) [13:38:14] (03PS2) 10Elukey: profile::hcaptcha: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277509 (https://phabricator.wikimedia.org/T420993) [13:38:14] (03PS2) 10Elukey: profile::dragonfly: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277510 (https://phabricator.wikimedia.org/T420993) [13:38:15] (03PS2) 10Elukey: profile::docker_registry: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277511 (https://phabricator.wikimedia.org/T420993) [13:38:16] (03PS2) 10Elukey: profile::cache::purge: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277512 (https://phabricator.wikimedia.org/T420993) [13:38:21] (03PS2) 10Elukey: profile::etcd::tlsproxy: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277513 (https://phabricator.wikimedia.org/T420993) [13:38:25] (03PS2) 10Elukey: profile::mediabackup: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) [13:38:33] (03CR) 10Elukey: "Ack yes, moved it to the end of the queue, feel free to do it anytime! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:39:41] !log aokoth@cumin1003 START - Cookbook sre.hosts.reimage for host phab2003.codfw.wmnet with OS bullseye [13:40:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277496 (https://phabricator.wikimedia.org/T422740) (owner: 10Esanders) [13:40:16] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1277550 (https://phabricator.wikimedia.org/T424521) [13:40:31] (03CR) 10Elukey: [C:03+2] profile::puppetdb: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277507 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:41:17] edsanders: please let me know when you’re done [13:41:58] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1201 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1277551 (https://phabricator.wikimedia.org/T424522) [13:42:04] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1277552 (https://phabricator.wikimedia.org/T424522) [13:42:28] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the platform-eng instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275858 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [13:42:39] (03CR) 10Btullis: [C:03+2] Deploy the new airflow version to the search instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275859 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [13:43:14] (03PS1) 10MVernon: swift: remove 3 drained eqiad backends for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1277553 (https://phabricator.wikimedia.org/T421719) [13:44:34] FIRING: [235x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:45:09] (03Merged) 10jenkins-bot: Deploy the new Airflow version to the platform-eng instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275858 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [13:45:12] (03Merged) 10jenkins-bot: Deploy the new airflow version to the search instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275859 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [13:45:15] (03CR) 10Marostegui: [C:03+1] swift: remove 3 drained eqiad backends for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1277553 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [13:46:25] (03Merged) 10jenkins-bot: Add experiment for suggestion mode beta feature [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277496 (https://phabricator.wikimedia.org/T422740) (owner: 10Esanders) [13:46:42] !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1277496|Add experiment for suggestion mode beta feature (T422740)]] [13:46:46] T422740: Implement Suggestion Mode instrumentation spec in Test Kitchen - https://phabricator.wikimedia.org/T422740 [13:47:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T419961)', diff saved to https://phabricator.wikimedia.org/P91607 and previous config saved to /var/cache/conftool/dbconfig/20260427-134732-fceratto.json [13:47:41] (03CR) 10MVernon: [C:03+2] swift: remove 3 drained eqiad backends for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1277553 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [13:47:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [13:48:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T419961)', diff saved to https://phabricator.wikimedia.org/P91608 and previous config saved to /var/cache/conftool/dbconfig/20260427-134800-fceratto.json [13:48:15] !log esanders@deploy1003 esanders: Backport for [[gerrit:1277496|Add experiment for suggestion mode beta feature (T422740)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:49:39] kostajh: IMHO you can already +2 your backport to start the gate-and-submit there [13:49:51] (while edsanders tests the experiment) [13:50:23] !log esanders@deploy1003 esanders: Continuing with deployment [13:50:26] looks good [13:52:30] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:53:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:54:08] !log esanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277496|Add experiment for suggestion mode beta feature (T422740)]] (duration: 07m 26s) [13:54:13] T422740: Implement Suggestion Mode instrumentation spec in Test Kitchen - https://phabricator.wikimedia.org/T422740 [13:54:22] jouncebot: nowandnext [13:54:22] For the next 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1300) [13:54:22] In 0 hour(s) and 35 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1430) [13:54:34] FIRING: [230x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:54:52] edsanders: all done? [13:54:55] all done [13:55:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277457 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [13:55:27] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:55:59] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:56:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T419961)', diff saved to https://phabricator.wikimedia.org/P91610 and previous config saved to /var/cache/conftool/dbconfig/20260427-135622-fceratto.json [13:56:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:40] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [13:57:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:57:22] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [13:59:10] (03PS1) 10Brouberol: kafka-jumbo: update kafka-jumbo1010 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277555 (https://phabricator.wikimedia.org/T300102) [14:00:34] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8472/co" [puppet] - 10https://gerrit.wikimedia.org/r/1277555 (https://phabricator.wikimedia.org/T300102) (owner: 10Brouberol) [14:01:39] (03CR) 10Elukey: [C:03+1] "gogogogogo" [puppet] - 10https://gerrit.wikimedia.org/r/1277555 (https://phabricator.wikimedia.org/T300102) (owner: 10Brouberol) [14:02:36] (03CR) 10Btullis: [C:03+1] kafka-jumbo: update kafka-jumbo1010 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277555 (https://phabricator.wikimedia.org/T300102) (owner: 10Brouberol) [14:02:48] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 10observability, and 4 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102#11861150 (10brouberol) a:03brouberol [14:03:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [14:03:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91611 and previous config saved to /var/cache/conftool/dbconfig/20260427-140309-fceratto.json [14:03:14] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:03:20] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: NodeTextfileStale (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T424001#11861169 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff Will soon be fixed by merging https://gerrit.wikimedia.org/r/c/... [14:03:38] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the analytics-product instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275860 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:03:44] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the research instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275861 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:04:08] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1159: Repooling [14:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 5d 23h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [14:06:14] (03Merged) 10jenkins-bot: Deploy the new Airflow version to the analytics-product instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275860 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:06:18] (03Merged) 10jenkins-bot: Deploy the new Airflow version to the research instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275861 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:06:22] (03Merged) 10jenkins-bot: hCaptcha: Emit load_duration once per load and add load_attempts [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277457 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [14:06:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P91613 and previous config saved to /var/cache/conftool/dbconfig/20260427-140630-fceratto.json [14:06:37] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1277457|hCaptcha: Emit load_duration once per load and add load_attempts (T421204)]] [14:07:01] (03CR) 10Scott French: [C:03+1] "Thanks for catching this, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1277497 (https://phabricator.wikimedia.org/T422964) (owner: 10Muehlenhoff) [14:07:09] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade Kafka to version 3.x - https://phabricator.wikimedia.org/T416669#11861218 (10brouberol) [14:08:13] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1277457|hCaptcha: Emit load_duration once per load and add load_attempts (T421204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:35] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:08:59] (03PS1) 10Brouberol: kafka-jumbo: update kafka-jumbo1010 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277558 (https://phabricator.wikimedia.org/T300102) [14:09:09] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [14:09:31] !log kharlan@deploy1003 kharlan: Continuing with deployment [14:09:31] (03CR) 10CI reject: [V:04-1] kafka-jumbo: update kafka-jumbo1010 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277558 (https://phabricator.wikimedia.org/T300102) (owner: 10Brouberol) [14:09:45] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [14:10:05] (03PS2) 10Brouberol: kafka-jumbo: update kafka-jumbo1010 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277555 (https://phabricator.wikimedia.org/T300102) [14:10:13] (03Abandoned) 10Brouberol: kafka-jumbo: update kafka-jumbo1010 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277558 (https://phabricator.wikimedia.org/T300102) (owner: 10Brouberol) [14:10:19] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [14:10:39] (03CR) 10Btullis: [C:03+2] opensearch-cluster: Istio configure bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277486 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [14:10:41] (03CR) 10Ayounsi: [C:03+1] netbox: Remove backups from netbox server, only leave postgres ones [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) (owner: 10Jcrespo) [14:10:51] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the ml instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275862 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:10:56] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the sre instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275863 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:11:01] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the wmde instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275864 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:11:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2211: after reimage to trixie [14:11:06] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the wikidata instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275865 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:11:10] (03CR) 10Brouberol: [C:03+2] kafka-jumbo: update kafka-jumbo1010 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277555 (https://phabricator.wikimedia.org/T300102) (owner: 10Brouberol) [14:11:11] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the fr-tech instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275866 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:12:16] (03CR) 10Ayounsi: [C:03+1] "lgtm! thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/1269011 (owner: 10Elukey) [14:12:48] !log btullis@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: dse-k8s-master-codfw@codfw [14:13:21] (03Merged) 10jenkins-bot: opensearch-cluster: Istio configure bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277486 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [14:13:23] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277457|hCaptcha: Emit load_duration once per load and add load_attempts (T421204)]] (duration: 06m 46s) [14:13:25] (03Merged) 10jenkins-bot: Deploy the new Airflow version to the ml instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275862 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:13:36] (03Merged) 10jenkins-bot: Deploy the new Airflow version to the sre instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275863 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:13:55] (03Merged) 10jenkins-bot: Deploy the new Airflow version to the wmde instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275864 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:14:00] (03Merged) 10jenkins-bot: Deploy the new Airflow version to the wikidata instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275865 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:14:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1091.eqiad.wmnet with OS bullseye [14:14:20] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA, 13Patch-For-Review: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11861256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002... [14:14:23] (03Merged) 10jenkins-bot: Deploy the new Airflow version to the fr-tech instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275866 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [14:14:25] (03CR) 10Scott French: [C:03+1] profile::docker_registry: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277511 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:14:34] FIRING: [223x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:14:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be1091 [14:14:52] (03CR) 10Scott French: [C:03+1] profile::etcd::tlsproxy: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277513 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:14:52] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [14:14:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277426 (owner: 10Kosta Harlan) [14:14:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277455 (https://phabricator.wikimedia.org/T412132) (owner: 10Kosta Harlan) [14:15:36] (03PS2) 10Kosta Harlan: hCaptcha: enable for mobile apps account creation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277455 (https://phabricator.wikimedia.org/T412132) [14:15:48] (03CR) 10Kosta Harlan: [C:03+2] hCaptcha: enable for mobile apps account creation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277455 (https://phabricator.wikimedia.org/T412132) (owner: 10Kosta Harlan) [14:15:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277455 (https://phabricator.wikimedia.org/T412132) (owner: 10Kosta Harlan) [14:16:17] (03Merged) 10jenkins-bot: wmgPrivilegedGroups/wmgPrivilegedGlobalGroups: Update to include temporary account IP viewers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277426 (owner: 10Kosta Harlan) [14:16:36] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-fr-tech: apply [14:16:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P91615 and previous config saved to /var/cache/conftool/dbconfig/20260427-141638-fceratto.json [14:17:07] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-fr-tech: apply [14:17:20] (03Merged) 10jenkins-bot: hCaptcha: enable for mobile apps account creation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277455 (https://phabricator.wikimedia.org/T412132) (owner: 10Kosta Harlan) [14:17:29] !log btullis@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for alias: dse-k8s-master-codfw@codfw [14:17:29] 06SRE, 06Traffic: Investigate port 80 page in text@esams for Ipv6 - https://phabricator.wikimedia.org/T423667#11861274 (10jasmine_) [14:17:33] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1277426|wmgPrivilegedGroups/wmgPrivilegedGlobalGroups: Update to include temporary account IP viewers]], [[gerrit:1277455|hCaptcha: enable for mobile apps account creation on testwiki (T412132)]] [14:17:41] T412132: Integrate HCaptcha into Account Creation flow - https://phabricator.wikimedia.org/T412132 [14:17:44] !log btullis@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: dse-k8s-master-codfw@codfw [14:18:09] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [14:18:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [14:18:45] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:19:13] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1277426|wmgPrivilegedGroups/wmgPrivilegedGlobalGroups: Update to include temporary account IP viewers]], [[gerrit:1277455|hCaptcha: enable for mobile apps account creation on testwiki (T412132)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:19:28] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1159: Repooling [14:19:34] FIRING: [223x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:19:39] FIRING: [23x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:20:07] !log kharlan@deploy1003 kharlan: Continuing with deployment [14:20:30] mvernon@cumin2002 reimage (PID 2735888) is awaiting input [14:20:30] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:20:35] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:21:08] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:22:51] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [14:23:17] btullis@cumin1003 migrate-service-ipip (PID 1396956) is awaiting input [14:23:21] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1091 - mvernon@cumin2002" [14:23:24] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [14:23:34] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [14:23:57] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277426|wmgPrivilegedGroups/wmgPrivilegedGlobalGroups: Update to include temporary account IP viewers]], [[gerrit:1277455|hCaptcha: enable for mobile apps account creation on testwiki (T412132)]] (duration: 06m 23s) [14:24:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1091 - mvernon@cumin2002" [14:24:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:24:01] T412132: Integrate HCaptcha into Account Creation flow - https://phabricator.wikimedia.org/T412132 [14:24:02] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be1091.eqiad.wmnet 21.48.64.10.in-addr.arpa 1.2.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:24:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be1091.eqiad.wmnet 21.48.64.10.in-addr.arpa 1.2.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:24:06] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1091 [14:24:18] !log btullis@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [14:24:19] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [14:24:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1091 [14:24:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be1091 [14:25:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [14:25:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: dse-k8s-master-codfw@codfw [14:26:00] (03PS3) 10Elukey: profile::hcaptcha: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277509 (https://phabricator.wikimedia.org/T420993) [14:26:00] (03PS3) 10Elukey: profile::dragonfly: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277510 (https://phabricator.wikimedia.org/T420993) [14:26:00] (03PS3) 10Elukey: profile::docker_registry: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277511 (https://phabricator.wikimedia.org/T420993) [14:26:01] (03PS3) 10Elukey: profile::cache::purge: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277512 (https://phabricator.wikimedia.org/T420993) [14:26:02] (03PS3) 10Elukey: profile::etcd::tlsproxy: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277513 (https://phabricator.wikimedia.org/T420993) [14:26:03] (03PS3) 10Elukey: profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) [14:26:07] (03PS3) 10Elukey: profile::mediabackup: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) [14:26:18] Done with syncing patches [14:26:26] (03CR) 10Jcrespo: [C:03+2] netbox: Remove backups from netbox server, only leave postgres ones [puppet] - 10https://gerrit.wikimedia.org/r/1273791 (https://phabricator.wikimedia.org/T423689) (owner: 10Jcrespo) [14:26:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T419961)', diff saved to https://phabricator.wikimedia.org/P91617 and previous config saved to /var/cache/conftool/dbconfig/20260427-142646-fceratto.json [14:27:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [14:27:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T419961)', diff saved to https://phabricator.wikimedia.org/P91618 and previous config saved to /var/cache/conftool/dbconfig/20260427-142714-fceratto.json [14:27:44] (03CR) 10Jcrespo: "Assuming he is ok, we waited for a couple of weeks." [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) (owner: 10Jcrespo) [14:27:47] (03CR) 10Jcrespo: [C:03+2] backup: Ignore /srv/docker from srv-deployment backups, move cluster mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) (owner: 10Jcrespo) [14:28:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [14:29:05] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [14:29:31] !log UTC afternoon backport+config window done [14:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:49] aokoth@cumin1003 reimage (PID 1369656) is awaiting input [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1430) [14:33:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Allow to easily disable puppet-merges temporarily - https://phabricator.wikimedia.org/T423121#11861407 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [14:35:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T419961)', diff saved to https://phabricator.wikimedia.org/P91619 and previous config saved to /var/cache/conftool/dbconfig/20260427-143523-fceratto.json [14:35:30] 06SRE, 06SRE Observability: Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11861436 (10MoritzMuehlenhoff) [14:37:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage [14:39:12] (03PS1) 10Arnaudb: gerrit: fix AlertLintProblem for GerritDiskSpaceExhaustionIncoming [alerts] - 10https://gerrit.wikimedia.org/r/1277565 (https://phabricator.wikimedia.org/T423601) [14:39:21] (03CR) 10Arnaudb: [C:03+2] gerrit: fix AlertLintProblem for GerritDiskSpaceExhaustionIncoming [alerts] - 10https://gerrit.wikimedia.org/r/1277565 (https://phabricator.wikimedia.org/T423601) (owner: 10Arnaudb) [14:39:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:40:53] (03Merged) 10jenkins-bot: gerrit: fix AlertLintProblem for GerritDiskSpaceExhaustionIncoming [alerts] - 10https://gerrit.wikimedia.org/r/1277565 (https://phabricator.wikimedia.org/T423601) (owner: 10Arnaudb) [14:42:03] (03Abandoned) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:42:21] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: puppet (systemd::service) attempts to start manually masked units - https://phabricator.wikimedia.org/T211027#11861477 (10LSobanski) 05Open→03Resolved a:03LSobanski This should no longer be an issue with a systemd define in place. Please reopen if... [14:42:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage [14:42:45] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [14:43:31] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Fix regex.yaml single-regex issue - https://phabricator.wikimedia.org/T183565#11861492 (10LSobanski) p:05Medium→03Low [14:43:47] (03PS1) 10DLynch: Enable mobile editor abandonment survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277569 (https://phabricator.wikimedia.org/T423923) [14:44:28] (03CR) 10Herron: [C:03+2] kafka-logging: set codfw brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1276745 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [14:45:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P91620 and previous config saved to /var/cache/conftool/dbconfig/20260427-144531-fceratto.json [14:46:36] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Use multiple puppetdbs on puppet masters - https://phabricator.wikimedia.org/T169318#11861528 (10LSobanski) 05Open→03Resolved a:03LSobanski The original bug is no longer applicable a separate task will be created for reviewing the availability pla... [14:47:05] !log add gnmic 0.45 to bookworm-wikimedia - T416360 [14:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:09] T416360: gNMIc: investigate new "collector" command - https://phabricator.wikimedia.org/T416360 [14:47:10] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [14:47:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T419635)', diff saved to https://phabricator.wikimedia.org/P91621 and previous config saved to /var/cache/conftool/dbconfig/20260427-144718-fceratto.json [14:47:22] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:47:27] jouncebot: now [14:47:27] For the next 0 hour(s) and 12 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1430) [14:47:31] jouncebot: next [14:47:32] In 0 hour(s) and 42 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1530) [14:48:14] is anyone deploying anything during this window? I mean, mediawiki wise? [14:49:13] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11861561 (10Ladsgroup) [14:49:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11861563 (10Jclark-ctr) 05Open→03Resolved [14:49:39] FIRING: [23x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:49:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T419635)', diff saved to https://phabricator.wikimedia.org/P91622 and previous config saved to /var/cache/conftool/dbconfig/20260427-144949-fceratto.json [14:50:11] (03PS1) 10DLynch: Backwards compatibility for mw.testKitchen.compat (ironically) [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277572 (https://phabricator.wikimedia.org/T422740) [14:50:45] !log add gnmic 0.45 to trixie-wikimedia - T416360 [14:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:51] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [14:51:25] 10SRE-tools, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 2 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11861581 (10jcrespo) 05Open→03Resolved Resolving unless issues are... [14:51:48] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host phab2003.codfw.wmnet with OS bullseye [14:55:02] !log aokoth@cumin1003 START - Cookbook sre.hosts.reimage for host phab2003.codfw.wmnet with OS bullseye [14:55:09] !log upgrade gnmic on netflow7002 - T416360 [14:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:13] T416360: gNMIc: investigate new "collector" command - https://phabricator.wikimedia.org/T416360 [14:55:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P91623 and previous config saved to /var/cache/conftool/dbconfig/20260427-145539-fceratto.json [14:55:57] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T421989#11861618 (10Jclark-ctr) 05Open→03Resolved [14:56:55] (03CR) 10Muehlenhoff: [C:03+2] mediawiki::php: Fix version of php-common if ICU72 is enabled [puppet] - 10https://gerrit.wikimedia.org/r/1277497 (https://phabricator.wikimedia.org/T422964) (owner: 10Muehlenhoff) [14:57:16] Hey folks, I will move forward with https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1000 now, since it had to be postponed earlier [14:58:07] !log jiji@deploy1003 Locking from deployment [ALL REPOSITORIES]: Upgrading mw-mcrouter - effie [14:58:40] (03CR) 10Effie Mouzeli: mw-mcrouter: bump image and new config (eqiad+codfw) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277300 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [14:58:44] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: bump image and new config (eqiad+codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277300 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [14:59:39] FIRING: [23x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:59:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1091.eqiad.wmnet with OS bullseye [14:59:56] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11861636 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1091.eqiad.... [14:59:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P91624 and previous config saved to /var/cache/conftool/dbconfig/20260427-145957-fceratto.json [15:00:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1092.eqiad.wmnet with OS bullseye [15:00:34] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11861640 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1092.eq... [15:00:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be1092 [15:01:00] (03Merged) 10jenkins-bot: mw-mcrouter: bump image and new config (eqiad+codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277300 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [15:02:44] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [15:03:46] (03PS4) 10Elukey: profile::cache::purge: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277512 (https://phabricator.wikimedia.org/T420993) [15:03:46] (03PS4) 10Elukey: profile::hcaptcha: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277509 (https://phabricator.wikimedia.org/T420993) [15:03:46] (03PS4) 10Elukey: profile::dragonfly: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277510 (https://phabricator.wikimedia.org/T420993) [15:03:46] (03PS4) 10Elukey: profile::docker_registry: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277511 (https://phabricator.wikimedia.org/T420993) [15:03:47] (03PS4) 10Elukey: profile::etcd::tlsproxy: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277513 (https://phabricator.wikimedia.org/T420993) [15:03:50] (03PS4) 10Elukey: profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) [15:03:54] (03PS4) 10Elukey: profile::mediabackup: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) [15:04:00] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11861661 (10MatthewVernon) [15:04:39] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:05:20] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:05:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T419961)', diff saved to https://phabricator.wikimedia.org/P91625 and previous config saved to /var/cache/conftool/dbconfig/20260427-150547-fceratto.json [15:05:52] I'm just going to do a quick bugfix backport. [15:06:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1252.eqiad.wmnet with reason: Maintenance [15:06:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T419961)', diff saved to https://phabricator.wikimedia.org/P91626 and previous config saved to /var/cache/conftool/dbconfig/20260427-150616-fceratto.json [15:06:32] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1070.eqiad.wmnet with reason: vacuum overlarge container dbs [15:06:42] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#11861670 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=65b38204-adf9-4abf-8c6d-03ef5e22cccb) set by ladsgroup@cum... [15:07:10] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:07:17] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1092 - mvernon@cumin2002" [15:07:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1092 - mvernon@cumin2002" [15:07:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:23] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be1092.eqiad.wmnet 32.32.64.10.in-addr.arpa 2.3.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:07:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be1092.eqiad.wmnet 32.32.64.10.in-addr.arpa 2.3.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:07:27] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1092 [15:07:29] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [15:07:30] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:07:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.eqiad.wmnet with OS trixie [15:07:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1092 [15:07:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be1092 [15:08:12] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [15:08:47] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [15:09:22] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:09:44] (03CR) 10Elukey: [C:03+2] profile::cache::purge: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277512 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:10:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P91627 and previous config saved to /var/cache/conftool/dbconfig/20260427-151005-fceratto.json [15:10:49] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:11:00] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [15:12:49] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [15:15:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T419961)', diff saved to https://phabricator.wikimedia.org/P91629 and previous config saved to /var/cache/conftool/dbconfig/20260427-151528-fceratto.json [15:18:11] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.7 - https://phabricator.wikimedia.org/T423723#11861717 (10herron) [15:19:01] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.7 - https://phabricator.wikimedia.org/T423723#11861722 (10herron) [15:19:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:19:35] (03CR) 10Andrew Bogott: [C:03+2] Add upstream repos for openstack flamingo and gazpacho [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [15:19:39] FIRING: [2x] CertAlmostExpired: Certificate for service wdqs1020:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:20:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T419635)', diff saved to https://phabricator.wikimedia.org/P91630 and previous config saved to /var/cache/conftool/dbconfig/20260427-152013-fceratto.json [15:20:19] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:20:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:20:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T419635)', diff saved to https://phabricator.wikimedia.org/P91631 and previous config saved to /var/cache/conftool/dbconfig/20260427-152038-fceratto.json [15:23:46] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:24:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:25:36] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:25:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P91632 and previous config saved to /var/cache/conftool/dbconfig/20260427-152536-fceratto.json [15:25:44] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:26:01] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage [15:26:30] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [15:26:34] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [15:26:53] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [15:27:01] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [15:27:36] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:28:20] (03PS5) 10Elukey: profile::dragonfly: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277510 (https://phabricator.wikimedia.org/T420993) [15:28:20] (03PS5) 10Elukey: profile::docker_registry: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277511 (https://phabricator.wikimedia.org/T420993) [15:28:20] (03PS5) 10Elukey: profile::hcaptcha: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277509 (https://phabricator.wikimedia.org/T420993) [15:28:21] (03PS5) 10Elukey: profile::etcd::tlsproxy: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277513 (https://phabricator.wikimedia.org/T420993) [15:28:22] (03PS5) 10Elukey: profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) [15:28:23] (03PS5) 10Elukey: profile::mediabackup: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) [15:29:39] RESOLVED: [2x] CertAlmostExpired: Certificate for service wdqs1020:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:29:45] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1092.eqiad.wmnet with reason: host reimage [15:30:05] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1530). [15:30:07] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:30:13] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:30:35] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:30:39] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:31:35] (03CR) 10Elukey: [C:03+2] profile::dragonfly: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277510 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:31:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage [15:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:35:20] !log jiji@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: Upgrading mw-mcrouter - effie (duration: 37m 12s) [15:35:32] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:35:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P91633 and previous config saved to /var/cache/conftool/dbconfig/20260427-153544-fceratto.json [15:35:54] (03PS1) 10Herron: kafka-logging: set eqiad (and all) brokers to confluent distro 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277581 (https://phabricator.wikimedia.org/T423723) [15:36:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1092.eqiad.wmnet with reason: host reimage [15:36:23] (03CR) 10Elukey: [C:03+1] kafka-logging: set eqiad (and all) brokers to confluent distro 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277581 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [15:36:28] !log dancy@deploy1003 Installing scap version "4.252.0" for 2 host(s) [15:37:15] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:37:32] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:37:36] !log mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 5 echo-subscriptions-email-page-review (T406724) [15:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:40] T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724 [15:38:03] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:38:21] !log dancy@deploy1003 Installation of scap version "4.252.0" completed for 2 hosts [15:38:38] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:39:39] FIRING: CertAlmostExpired: Certificate for service wdqs1016:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:40:28] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:41:18] (03CR) 10Elukey: [C:03+2] profile::docker_registry: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277511 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:42:18] kemayo: You should be good to deploy now [15:42:31] dancy: thanks! [15:42:35] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host phab2003.codfw.wmnet with OS bullseye [15:42:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277572 (https://phabricator.wikimedia.org/T422740) (owner: 10DLynch) [15:43:23] !log aokoth@cumin1003 START - Cookbook sre.hosts.reimage for host phab2003.codfw.wmnet with OS trixie [15:44:16] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:44:28] (03Merged) 10jenkins-bot: Backwards compatibility for mw.testKitchen.compat (ironically) [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277572 (https://phabricator.wikimedia.org/T422740) (owner: 10DLynch) [15:44:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:44:45] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1277572|Backwards compatibility for mw.testKitchen.compat (ironically) (T422740)]] [15:44:49] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:44:49] T422740: Implement Suggestion Mode instrumentation spec in Test Kitchen - https://phabricator.wikimedia.org/T422740 [15:45:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T419961)', diff saved to https://phabricator.wikimedia.org/P91634 and previous config saved to /var/cache/conftool/dbconfig/20260427-154552-fceratto.json [15:46:08] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:46:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1260.eqiad.wmnet with reason: Maintenance [15:46:18] !log ladsgroup@cumin1003 START - Cookbook sre.hosts.remove-downtime for ms-be1070.eqiad.wmnet [15:46:19] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1070.eqiad.wmnet [15:46:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1260 (T419961)', diff saved to https://phabricator.wikimedia.org/P91635 and previous config saved to /var/cache/conftool/dbconfig/20260427-154620-fceratto.json [15:46:24] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1277572|Backwards compatibility for mw.testKitchen.compat (ironically) (T422740)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:47:06] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11861872 (10MoritzMuehlenhoff) [15:47:33] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:48:03] (03PS6) 10Elukey: profile::etcd::tlsproxy: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277513 (https://phabricator.wikimedia.org/T420993) [15:48:03] (03PS6) 10Elukey: profile::hcaptcha: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277509 (https://phabricator.wikimedia.org/T420993) [15:48:03] (03PS6) 10Elukey: profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) [15:48:04] (03PS6) 10Elukey: profile::mediabackup: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) [15:49:25] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:49:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:50:57] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2171: Repooling [15:51:40] (03CR) 10Elukey: [C:03+2] profile::etcd::tlsproxy: move to the new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277513 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:52:52] (03CR) 10Brouberol: [C:03+1] kafka-logging: set eqiad (and all) brokers to confluent distro 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277581 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [15:53:04] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1210: Repooling [15:54:16] (03CR) 10Herron: [V:03+1] "Thanks for the quick reviews! I'll plan to go ahead with this tomorrow (4/28)" [puppet] - 10https://gerrit.wikimedia.org/r/1277581 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [15:54:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:54:49] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:55:14] (03CR) 10Bking: [C:03+2] cloudelastic: set role-level hiera for OpenSearch 2/Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1276818 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [15:55:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1092.eqiad.wmnet with OS bullseye [15:56:07] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11861921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1092.eqiad.... [15:56:50] !log kemayo@deploy1003 kemayo: Continuing with deployment [15:56:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T419961)', diff saved to https://phabricator.wikimedia.org/P91638 and previous config saved to /var/cache/conftool/dbconfig/20260427-155655-fceratto.json [15:57:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1093.eqiad.wmnet with OS bullseye [15:57:13] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11861927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1093.eq... [15:57:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be1093 [15:57:41] !log aokoth@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on phab2003.codfw.wmnet with reason: host reimage [15:57:47] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [15:59:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:59:39] FIRING: [2x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:00:42] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277572|Backwards compatibility for mw.testKitchen.compat (ironically) (T422740)]] (duration: 15m 57s) [16:00:48] T422740: Implement Suggestion Mode instrumentation spec in Test Kitchen - https://phabricator.wikimedia.org/T422740 [16:01:49] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [16:01:55] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [16:01:55] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1093 - mvernon@cumin2002" [16:02:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1093 - mvernon@cumin2002" [16:02:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:02:02] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be1093.eqiad.wmnet 133.48.64.10.in-addr.arpa 3.3.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:02:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be1093.eqiad.wmnet 133.48.64.10.in-addr.arpa 3.3.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:02:06] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1093 [16:05:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1093 [16:05:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be1093 [16:06:10] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2171: Repooling [16:06:23] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab2003.codfw.wmnet with reason: host reimage [16:07:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P91640 and previous config saved to /var/cache/conftool/dbconfig/20260427-160704-fceratto.json [16:08:22] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1210: Repooling [16:09:19] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [16:10:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91642 and previous config saved to /var/cache/conftool/dbconfig/20260427-161023-fceratto.json [16:10:29] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:12:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91643 and previous config saved to /var/cache/conftool/dbconfig/20260427-161233-fceratto.json [16:14:39] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:14:48] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11862034 (10MatthewVernon) [16:15:44] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1277598 (https://phabricator.wikimedia.org/T424550) [16:15:55] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1277599 (https://phabricator.wikimedia.org/T424550) [16:15:57] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1277600 (https://phabricator.wikimedia.org/T424551) [16:16:03] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1277601 (https://phabricator.wikimedia.org/T424551) [16:16:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P91644 and previous config saved to /var/cache/conftool/dbconfig/20260427-161710-fceratto.json [16:21:50] (03PS1) 10AKhatun: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277605 (https://phabricator.wikimedia.org/T424223) [16:22:43] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host phab2003.codfw.wmnet with OS trixie [16:22:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91646 and previous config saved to /var/cache/conftool/dbconfig/20260427-162243-fceratto.json [16:24:34] FIRING: [220x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:24:39] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:24:56] (03CR) 10BryanDavis: "Could this change be the cause of T424549 in the Beta Cluster?" [puppet] - 10https://gerrit.wikimedia.org/r/1277512 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [16:25:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [16:25:51] Lumen (442550281) {#3867}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F3%2F2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [16:26:09] !ack [16:26:09] 7876 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [16:26:14] yay [16:27:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T419961)', diff saved to https://phabricator.wikimedia.org/P91647 and previous config saved to /var/cache/conftool/dbconfig/20260427-162718-fceratto.json [16:27:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1093.eqiad.wmnet with reason: host reimage [16:27:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1261.eqiad.wmnet with reason: Maintenance [16:27:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1261 (T419961)', diff saved to https://phabricator.wikimedia.org/P91648 and previous config saved to /var/cache/conftool/dbconfig/20260427-162746-fceratto.json [16:28:22] (03PS3) 10Atsuko: deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) [16:28:58] (03CR) 10CI reject: [V:04-1] deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [16:29:34] FIRING: [220x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:32:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1093.eqiad.wmnet with reason: host reimage [16:32:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91649 and previous config saved to /var/cache/conftool/dbconfig/20260427-163251-fceratto.json [16:33:16] (03PS4) 10Atsuko: deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) [16:33:18] (03PS1) 10Bking: cloudelastic: update allowed plugins list [puppet] - 10https://gerrit.wikimedia.org/r/1277614 (https://phabricator.wikimedia.org/T422860) [16:33:29] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277614 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [16:33:50] (03CR) 10CI reject: [V:04-1] deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [16:34:19] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [16:35:51] Lumen (442550281) {#3867}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F3%2F2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [16:36:19] (03PS5) 10Atsuko: deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) [16:37:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T419961)', diff saved to https://phabricator.wikimedia.org/P91650 and previous config saved to /var/cache/conftool/dbconfig/20260427-163725-fceratto.json [16:40:10] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [16:40:15] (03CR) 10Bking: [C:03+2] cloudelastic: update allowed plugins list [puppet] - 10https://gerrit.wikimedia.org/r/1277614 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [16:41:00] (03PS1) 10Dzahn: doc: lower CDN caching from 1 hour to 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1277621 (https://phabricator.wikimedia.org/T423951) [16:43:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91651 and previous config saved to /var/cache/conftool/dbconfig/20260427-164259-fceratto.json [16:43:05] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:43:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:43:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:43:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P91652 and previous config saved to /var/cache/conftool/dbconfig/20260427-164323-fceratto.json [16:43:32] (03PS1) 10JMeybohm: admin_ng: Move all clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277622 (https://phabricator.wikimedia.org/T420993) [16:45:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P91653 and previous config saved to /var/cache/conftool/dbconfig/20260427-164533-fceratto.json [16:46:05] (03CR) 10Elukey: [C:03+1] "<3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277622 (https://phabricator.wikimedia.org/T420993) (owner: 10JMeybohm) [16:47:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P91654 and previous config saved to /var/cache/conftool/dbconfig/20260427-164734-fceratto.json [16:49:58] (03CR) 10Bking: [C:03+1] profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [16:51:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1093.eqiad.wmnet with OS bullseye [16:52:05] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11862262 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1093.eqiad.... [16:53:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1010.eqiad.wmnet with OS trixie [16:55:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P91656 and previous config saved to /var/cache/conftool/dbconfig/20260427-165541-fceratto.json [16:57:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P91657 and previous config saved to /var/cache/conftool/dbconfig/20260427-165742-fceratto.json [16:58:48] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1700) [17:00:05] ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T1700). [17:00:13] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1012. [17:00:23] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1012. [17:04:34] FIRING: [220x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:05:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P91658 and previous config saved to /var/cache/conftool/dbconfig/20260427-170550-fceratto.json [17:07:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T419961)', diff saved to https://phabricator.wikimedia.org/P91659 and previous config saved to /var/cache/conftool/dbconfig/20260427-170750-fceratto.json [17:08:11] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1262.eqiad.wmnet with reason: Maintenance [17:08:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1262 (T419961)', diff saved to https://phabricator.wikimedia.org/P91660 and previous config saved to /var/cache/conftool/dbconfig/20260427-170819-fceratto.json [17:09:34] FIRING: [219x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:10:47] nice that alerts are summarized nowadays [17:11:31] FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [17:13:26] (03CR) 10Santiago Faci: [C:04-1] EventStreamConfig: remove ABST contextual attribute (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001) (owner: 10Bearloga) [17:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:34] FIRING: [220x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:16:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P91661 and previous config saved to /var/cache/conftool/dbconfig/20260427-171600-fceratto.json [17:16:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:16:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [17:16:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T419635)', diff saved to https://phabricator.wikimedia.org/P91662 and previous config saved to /var/cache/conftool/dbconfig/20260427-171615-fceratto.json [17:16:31] RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [17:17:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T419961)', diff saved to https://phabricator.wikimedia.org/P91663 and previous config saved to /var/cache/conftool/dbconfig/20260427-171659-fceratto.json [17:18:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:18:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T419635)', diff saved to https://phabricator.wikimedia.org/P91664 and previous config saved to /var/cache/conftool/dbconfig/20260427-171826-fceratto.json [17:19:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:24:48] (03PS1) 10Bking: cloudelastic: Another required plugins update patch [puppet] - 10https://gerrit.wikimedia.org/r/1277640 (https://phabricator.wikimedia.org/T422860) [17:25:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277640 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:25:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) (owner: 10Codename Noreste) [17:26:01] (03PS4) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [17:26:31] (03CR) 10CI reject: [V:04-1] beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) (owner: 10BryanDavis) [17:27:03] (03PS2) 10Bking: cloudelastic: Another required plugins update patch [puppet] - 10https://gerrit.wikimedia.org/r/1277640 (https://phabricator.wikimedia.org/T422860) [17:27:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P91665 and previous config saved to /var/cache/conftool/dbconfig/20260427-172707-fceratto.json [17:27:17] (03CR) 10Bking: [C:03+2] cloudelastic: Another required plugins update patch [puppet] - 10https://gerrit.wikimedia.org/r/1277640 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:28:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P91666 and previous config saved to /var/cache/conftool/dbconfig/20260427-172835-fceratto.json [17:34:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:36:50] (03CR) 10Ssingh: [C:03+1] "Sounds like an acceptable compromise." [puppet] - 10https://gerrit.wikimedia.org/r/1277621 (https://phabricator.wikimedia.org/T423951) (owner: 10Dzahn) [17:37:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P91667 and previous config saved to /var/cache/conftool/dbconfig/20260427-173715-fceratto.json [17:38:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P91668 and previous config saved to /var/cache/conftool/dbconfig/20260427-173843-fceratto.json [17:39:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:40:13] (03PS6) 10Atsuko: deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) [17:43:56] (03PS1) 10Santiago Faci: JS SDK: Add aliases for compatibility with existing experiment code [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277656 (https://phabricator.wikimedia.org/T419513) [17:44:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277656 (https://phabricator.wikimedia.org/T419513) (owner: 10Santiago Faci) [17:46:17] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [17:47:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T419961)', diff saved to https://phabricator.wikimedia.org/P91669 and previous config saved to /var/cache/conftool/dbconfig/20260427-174723-fceratto.json [17:47:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1263.eqiad.wmnet with reason: Maintenance [17:47:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1263 (T419961)', diff saved to https://phabricator.wikimedia.org/P91670 and previous config saved to /var/cache/conftool/dbconfig/20260427-174751-fceratto.json [17:48:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T419635)', diff saved to https://phabricator.wikimedia.org/P91671 and previous config saved to /var/cache/conftool/dbconfig/20260427-174852-fceratto.json [17:48:57] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:48:58] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [17:49:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T419635)', diff saved to https://phabricator.wikimedia.org/P91672 and previous config saved to /var/cache/conftool/dbconfig/20260427-174906-fceratto.json [17:51:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T419635)', diff saved to https://phabricator.wikimedia.org/P91673 and previous config saved to /var/cache/conftool/dbconfig/20260427-175116-fceratto.json [17:54:39] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:55:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 257784136 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:55:52] 07Puppet, 06SRE, 03Readers Essential Work (Simplify MobileFrontend): Certain mobile devices are (possibly) not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#11862498 (10Jdlrobson-WMF) 05Stalled→03Declined Declining for now. [17:56:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3493344 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:56:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T419961)', diff saved to https://phabricator.wikimedia.org/P91674 and previous config saved to /var/cache/conftool/dbconfig/20260427-175626-fceratto.json [17:59:39] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:01:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P91675 and previous config saved to /var/cache/conftool/dbconfig/20260427-180124-fceratto.json [18:02:27] (03PS1) 10Stoyofuku-wmf: Enable the reading list beta feature survey on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277667 (https://phabricator.wikimedia.org/T421776) [18:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 5d 19h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [18:06:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P91676 and previous config saved to /var/cache/conftool/dbconfig/20260427-180635-fceratto.json [18:09:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:11:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P91677 and previous config saved to /var/cache/conftool/dbconfig/20260427-181132-fceratto.json [18:14:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:15:32] (03CR) 10Ssingh: [C:03+1] conftool: add sophroid etcd data [puppet] - 10https://gerrit.wikimedia.org/r/1248611 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:16:41] (03CR) 10RLazarus: [C:03+1] "+1 after the nits below, no need for another review :)" [puppet] - 10https://gerrit.wikimedia.org/r/1277148 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:16:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P91678 and previous config saved to /var/cache/conftool/dbconfig/20260427-181643-fceratto.json [18:18:13] (03CR) 10Ssingh: [C:03+1] wmnet: add sophroid svc IPs [dns] - 10https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:19:06] (03PS1) 10RLazarus: admin: Add rzl backup Yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1277676 [18:19:38] (03CR) 10Jasmine: [C:03+2] conftool: add sophroid etcd data [puppet] - 10https://gerrit.wikimedia.org/r/1248611 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:21:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T419635)', diff saved to https://phabricator.wikimedia.org/P91679 and previous config saved to /var/cache/conftool/dbconfig/20260427-182140-fceratto.json [18:21:46] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:21:46] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [18:21:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T419635)', diff saved to https://phabricator.wikimedia.org/P91680 and previous config saved to /var/cache/conftool/dbconfig/20260427-182154-fceratto.json [18:23:06] (03CR) 10Jasmine: [C:03+2] wmnet: add sophroid svc IPs [dns] - 10https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:24:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T419635)', diff saved to https://phabricator.wikimedia.org/P91681 and previous config saved to /var/cache/conftool/dbconfig/20260427-182404-fceratto.json [18:24:10] (03CR) 10Andrew Bogott: "It took me a while to figure out what this was used for, eventually I found this sentence on phab T320815: "We have been using it to authe" [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [18:24:42] !log jasmine@dns1004 START - running authdns-update [18:25:20] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277605 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [18:26:21] !log jasmine@dns1004 END - running authdns-update [18:26:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T419961)', diff saved to https://phabricator.wikimedia.org/P91682 and previous config saved to /var/cache/conftool/dbconfig/20260427-182652-fceratto.json [18:29:39] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:29:40] (03PS2) 10Jasmine: service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1277148 (https://phabricator.wikimedia.org/T418748) [18:30:35] (03CR) 10Jasmine: service::catalog: add sophroid service catalog entry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1277148 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:30:58] (03CR) 10Jasmine: [C:03+2] service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1277148 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:32:16] (03CR) 10Jasmine: [C:03+2] role::aux_k8s::worker: add sophroid to lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:34:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P91685 and previous config saved to /var/cache/conftool/dbconfig/20260427-183413-fceratto.json [18:34:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:34:39] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:37:16] (03PS1) 10Jasmine: service::catalog: sophroid: set state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1277684 (https://phabricator.wikimedia.org/T418748) [18:37:16] jasmine_: so we may need to wipe the caches for sophroid.svc.$site.wmnet [18:38:13] that's actually my bad, I was checking if the record was active and it cached the NDOMAIN [18:38:26] so that's sudo cumin 'A:dnsbox' 'rec_control wipe-cache sophroid.svc.codfw.wmnet$' [18:38:38] sukhe@cumin1003:~$ dig sophroid.svc.codfw.wmnet +short [18:38:38] 10.2.1.41 [18:38:38] sukhe@cumin1003:~$ dig sophroid.svc.eqiad.wmnet +short [18:38:38] 10.2.2.41 [18:38:39] now it's fine [18:38:55] there is also a cookbook for that ;) [18:39:11] volans: oh right, I am so used to using A:dnsbox that I forgot. indeed yes. [18:39:22] jasmine_: cookbook sre.dns.wipe-cache [18:39:27] :-P [18:39:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:40:40] sukhe: no worries, & thank you! [18:41:15] (03PS1) 10Bking: cloudelastic: Another ugly plugins patch [puppet] - 10https://gerrit.wikimedia.org/r/1277687 (https://phabricator.wikimedia.org/T422860) [18:41:46] (03CR) 10CI reject: [V:04-1] cloudelastic: Another ugly plugins patch [puppet] - 10https://gerrit.wikimedia.org/r/1277687 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [18:42:34] (03CR) 10Ssingh: [C:03+1] service::catalog: sophroid: set state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1277684 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:43:19] (03PS2) 10Bking: cloudelastic: Another ugly plugins patch [puppet] - 10https://gerrit.wikimedia.org/r/1277687 (https://phabricator.wikimedia.org/T422860) [18:44:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P91686 and previous config saved to /var/cache/conftool/dbconfig/20260427-184421-fceratto.json [18:44:24] (03CR) 10Bking: [C:03+2] cloudelastic: Another ugly plugins patch [puppet] - 10https://gerrit.wikimedia.org/r/1277687 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [18:48:09] (03CR) 10Jasmine: [C:03+2] service::catalog: sophroid: set state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1277684 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:48:42] (03PS1) 10Bking: cloudelastic: Add back the opensearch-ltr plugin [puppet] - 10https://gerrit.wikimedia.org/r/1277692 (https://phabricator.wikimedia.org/T422860) [18:49:03] (03PS3) 10Ecarg: Wikifunctions: add helm values for function-evaluator in Rust [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) [18:49:59] (03CR) 10Bking: [C:03+2] cloudelastic: Add back the opensearch-ltr plugin [puppet] - 10https://gerrit.wikimedia.org/r/1277692 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [18:51:04] (03PS1) 10Alex.sanford: Add 2FA enforcement demotion config for phase 1 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277693 [18:51:29] (03CR) 10Catrope: [C:03+1] Add 2FA enforcement demotion config for phase 1 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277693 (owner: 10Alex.sanford) [18:54:12] Catrope and I are doing a config deployment now [18:54:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T419635)', diff saved to https://phabricator.wikimedia.org/P91687 and previous config saved to /var/cache/conftool/dbconfig/20260427-185429-fceratto.json [18:54:34] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:54:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [18:54:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1210 (T419635)', diff saved to https://phabricator.wikimedia.org/P91688 and previous config saved to /var/cache/conftool/dbconfig/20260427-185444-fceratto.json [18:54:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277693 (owner: 10Alex.sanford) [18:55:08] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [18:55:08] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [18:55:39] (03Merged) 10jenkins-bot: Add 2FA enforcement demotion config for phase 1 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277693 (owner: 10Alex.sanford) [18:56:02] !log alexsanford@deploy1003 Started scap sync-world: Backport for [[gerrit:1277693|Add 2FA enforcement demotion config for phase 1 groups]] [18:57:42] !log alexsanford@deploy1003 alexsanford: Backport for [[gerrit:1277693|Add 2FA enforcement demotion config for phase 1 groups]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:57:44] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd38bb95550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [18:57:44] dia.org/wiki/Search%23Administration [18:57:44] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f8eebff9550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [18:57:44] dia.org/wiki/Search%23Administration [18:57:44] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f178fa59550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [18:57:44] dia.org/wiki/Search%23Administration [18:59:25] FIRING: [3x] SystemdUnitFailed: opensearch_2@cloudelastic-chi-eqiad.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:59:40] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:59:44] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1624, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 8, delayed_unassigned [18:59:44] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.50980392156863 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:00:00] !log "Restarting pybal on the backup LVS servers in eqiad" [19:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:27] !log alexsanford@deploy1003 alexsanford: Continuing with deployment [19:00:44] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:00:47] (03CR) 10RLazarus: "For OOB verification I can just self-puppet-merge (which proves I'm the same rzl who can already ssh to the puppetserver) but happy to do " [puppet] - 10https://gerrit.wikimedia.org/r/1277676 (owner: 10RLazarus) [19:00:48] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:00:50] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:01:44] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 825, active_shards: 1645, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 6, delayed_unassi [19:01:44] rds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.6365838885524 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:01:44] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1474, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 59, delayed_unassigne [19:01:44] : 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 96.15133724722766 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:02:32] (03PS2) 10Dzahn: doc: lower CDN caching from 1 hour to 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1277621 (https://phabricator.wikimedia.org/T423951) [19:03:02] (03PS1) 10Ebernhardson: cirrus: AB test query suggester variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277701 (https://phabricator.wikimedia.org/T407432) [19:03:02] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - sophroid_4252: Servers aux-k8s-worker1005.eqiad.wmnet, aux-k8s-worker1007.eqiad.wmnet, aux-k8s-worker1008.eqiad.wmnet, aux-k8s-worker1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:03:32] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 80 connections established with conf1007.eqiad.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [19:03:49] (03CR) 10CI reject: [V:04-1] cirrus: AB test query suggester variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277701 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [19:04:25] FIRING: [6x] SystemdUnitFailed: opensearch_2@cloudelastic-chi-eqiad.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:05] !log alexsanford@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277693|Add 2FA enforcement demotion config for phase 1 groups]] (duration: 09m 03s) [19:05:11] (03CR) 10Dzahn: [C:03+2] doc: lower CDN caching from 1 hour to 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1277621 (https://phabricator.wikimedia.org/T423951) (owner: 10Dzahn) [19:06:30] 06SRE, 10Continuous-Integration-Infrastructure, 06Traffic, 13Patch-For-Review: Lower varnish caching length on doc.wikimedia.org - https://phabricator.wikimedia.org/T184255#11862758 (10Dzahn) Further lowered caching from 1 hour to 10 minutes as a reaction to T423951. [19:06:48] (03PS1) 10Daniel Kinzler: rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 [19:06:54] !log "Restarting pybal on primary LVS servers in eqiad" [19:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:34] (03PS2) 10Daniel Kinzler: rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 [19:07:40] (03CR) 10Dzahn: [C:03+2] trafficserver: create a map for zuul.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1178084 (owner: 10Dzahn) [19:08:32] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 81 connections established with conf1007.eqiad.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [19:08:48] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - sophroid_4252: Servers aux-k8s-worker1005.eqiad.wmnet, aux-k8s-worker1007.eqiad.wmnet, aux-k8s-worker1008.eqiad.wmnet, aux-k8s-worker1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:08:57] oh that's interesting [19:09:31] jasmine_: [19:09:32] sukhe@lvs1020:~$ curl localhost:9090/pools/sophroid_4252 [19:09:32] aux-k8s-worker1004.eqiad.wmnet: enabled/down/not pooled [19:09:33] aux-k8s-worker1006.eqiad.wmnet: enabled/down/not pooled [19:09:33] aux-k8s-worker1009.eqiad.wmnet: enabled/down/not pooled [19:09:35] aux-k8s-worker1002.eqiad.wmnet: enabled/down/not pooled [19:09:37] aux-k8s-worker1007.eqiad.wmnet: enabled/down/pooled [19:09:40] aux-k8s-worker1003.eqiad.wmnet: enabled/down/pooled [19:09:40] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:09:42] aux-k8s-worker1005.eqiad.wmnet: enabled/down/pooled [19:09:45] aux-k8s-worker1008.eqiad.wmnet: enabled/down/pooled [19:09:47] I should not spam this channel probably [19:11:44] (03CR) 10Dzahn: [C:03+2] zuul: Use master branch of integration/config [puppet] - 10https://gerrit.wikimedia.org/r/1277198 (owner: 10Dduvall) [19:14:12] (03CR) 10Dzahn: "what about the port 9000? this patch seems to be dropping that part" [puppet] - 10https://gerrit.wikimedia.org/r/1277195 (owner: 10Dduvall) [19:14:25] RESOLVED: [6x] SystemdUnitFailed: opensearch_2@cloudelastic-chi-eqiad.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:26] (03PS1) 10Bking: cloudelastic: remove a plugin path that doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/1277704 (https://phabricator.wikimedia.org/T422860) [19:15:23] (03CR) 10Bking: [C:03+2] cloudelastic: remove a plugin path that doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/1277704 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:15:44] (03CR) 10Dzahn: "nevermind, I see it now in the zuul config file" [puppet] - 10https://gerrit.wikimedia.org/r/1277195 (owner: 10Dduvall) [19:18:02] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1277195/8475/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1277195 (owner: 10Dduvall) [19:19:43] !log dancy@deploy1003 Installing scap version "4.253.0" for 2 host(s) [19:20:25] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:20:44] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:20:48] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:20:50] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:21:34] !log dancy@deploy1003 Installation of scap version "4.253.0" completed for 2 hosts [19:22:27] Krinkle: I fixed the issue with the weird prompt output from scap. [19:22:40] scap backport, that is. [19:23:44] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f1797141550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [19:23:44] dia.org/wiki/Search%23Administration [19:23:44] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fce320cd550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [19:23:44] dia.org/wiki/Search%23Administration [19:23:44] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f117cb79550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [19:23:44] dia.org/wiki/Search%23Administration [19:24:06] (03PS1) 10Bking: cloudelastic: get rid of merge overrides [puppet] - 10https://gerrit.wikimedia.org/r/1277708 (https://phabricator.wikimedia.org/T422860) [19:24:21] ^^ expected, should be fixed shortly [19:24:36] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277708 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:24:36] (03PS1) 10Daniel Kinzler: rewst-gateway: switch to Apr2026 reate limit policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277709 (https://phabricator.wikimedia.org/T417779) [19:24:47] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:24:51] (03CR) 10Ssingh: "Sorry for the late review. Let's plan on when to merge this, since we may have to be a bit careful about this one. (I am not sure what exa" [puppet] - 10https://gerrit.wikimedia.org/r/1275750 (https://phabricator.wikimedia.org/T415454) (owner: 10Slyngshede) [19:25:49] (03CR) 10Ssingh: "By that I mean, fallback to regular TCP should kick in but yeah, we should be careful." [puppet] - 10https://gerrit.wikimedia.org/r/1275750 (https://phabricator.wikimedia.org/T415454) (owner: 10Slyngshede) [19:30:17] (03PS1) 10Bking: wdqs: enable new discovery intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1277713 (https://phabricator.wikimedia.org/T420993) [19:30:29] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277713 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [19:30:55] FIRING: [6x] SystemdUnitFailed: opensearch_2@cloudelastic-chi-eqiad.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:31:44] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:31:48] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:31:50] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:31:57] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:32:03] !log root@apt1002:~# reprepro --noskipold --restrict vopsbot update bookworm-wikimedia [19:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:07] (03CR) 10Bking: [C:03+2] cloudelastic: get rid of merge overrides [puppet] - 10https://gerrit.wikimedia.org/r/1277708 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:32:17] (03CR) 10Dzahn: [C:03+2] "@Dduvall I just ran a cache purge for this new URL - seeing an empty page now - but not the default Wikimedia error page anymore" [puppet] - 10https://gerrit.wikimedia.org/r/1178084 (owner: 10Dzahn) [19:32:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11862862 (10VRiley-WMF) [19:33:14] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:33:33] (03CR) 10Dzahn: [C:03+2] "check the source - there is actual HTML served - it does contain things like " d=this["webpackJsonp@zuul-ci/dashboard"]" so this is zuul" [puppet] - 10https://gerrit.wikimedia.org/r/1178084 (owner: 10Dzahn) [19:34:04] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:34:39] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:35:44] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1474, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 59, delayed_unassigne [19:35:44] : 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 96.15133724722766 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:35:55] FIRING: [6x] SystemdUnitFailed: opensearch_2@cloudelastic-chi-eqiad.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:44] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 825, active_shards: 1645, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 6, delayed_unassi [19:36:44] rds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.6365838885524 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:36:44] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1624, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 8, delayed_unassigned [19:36:44] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.50980392156863 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:40:07] 06SRE, 06DBA, 07Wikimedia-Incident: External store unreachable: "Database servers in clusterXX are overloaded" - https://phabricator.wikimedia.org/T422130#11862876 (10Krinkle) [19:40:55] RESOLVED: [6x] SystemdUnitFailed: opensearch_2@cloudelastic-chi-eqiad.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:41:35] !log "Restarting pybal on the backup LVS servers in codfw" [19:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:44] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:41:48] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:41:50] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:43:00] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:43:40] (03PS5) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [19:44:10] (03CR) 10CI reject: [V:04-1] beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) (owner: 10BryanDavis) [19:44:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:45:06] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:45:42] PROBLEM - PyBal connections to etcd on lvs2014 is CRITICAL: CRITICAL: 97 connections established with conf2004.codfw.wmnet:4001 (min=98) https://wikitech.wikimedia.org/wiki/PyBal [19:45:42] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 79 connections established with conf2004.codfw.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [19:45:49] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1057.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:45:50] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:48:00] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:48:56] vriley@cumin1003 provision (PID 1632896) is awaiting input [19:49:32] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - sophroid_4252: Servers aux-k8s-worker2003.codfw.wmnet, aux-k8s-worker2005.codfw.wmnet, aux-k8s-worker2002.codfw.wmnet, aux-k8s-worker2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:49:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:49:43] !log "Restarting pybal on primary LVS servers in codfw" [19:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:42] RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 98 connections established with conf2004.codfw.wmnet:4001 (min=98) https://wikitech.wikimedia.org/wiki/PyBal [19:50:42] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 80 connections established with conf2004.codfw.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [19:50:50] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:52:14] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - sophroid_4252: Servers aux-k8s-worker2003.codfw.wmnet, aux-k8s-worker2005.codfw.wmnet, aux-k8s-worker2002.codfw.wmnet, aux-k8s-worker2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:54:34] FIRING: [217x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:55:00] (03CR) 10ArielGlenn: rewst-gateway: switch to Apr2026 reate limit policy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277709 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [19:59:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T2000). [20:00:05] sfaci: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:02:08] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1057.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:02:31] (03PS6) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [20:02:39] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:03:16] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:03:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11862953 (10VRiley-WMF) [20:04:04] (03PS1) 10Jasmine: service::catalog: sophroid: set state to production [puppet] - 10https://gerrit.wikimedia.org/r/1277726 (https://phabricator.wikimedia.org/T418748) [20:04:05] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:04:39] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:05:08] o/ Sorry! I'm a bit late. Could someone deploy the patch I have scheduled? [20:08:04] sfaci: Sure thing [20:08:30] dancy: Thank you very much! [20:08:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277656 (https://phabricator.wikimedia.org/T419513) (owner: 10Santiago Faci) [20:10:00] (03Merged) 10jenkins-bot: JS SDK: Add aliases for compatibility with existing experiment code [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1277656 (https://phabricator.wikimedia.org/T419513) (owner: 10Santiago Faci) [20:10:15] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1277656|JS SDK: Add aliases for compatibility with existing experiment code (T419513)]] [20:10:20] T419513: JS SDK: Read everyone experiment enrollment from the WMF-Uniq server timing header - https://phabricator.wikimedia.org/T419513 [20:10:33] (03PS2) 10Bartosz Dziewoński: rest-gateway: switch to Apr2026 rate limit policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277709 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [20:10:46] (03PS3) 10Bartosz Dziewoński: rest-gateway: switch to Apr2026 rate limit policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277709 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [20:10:56] (03CR) 10Bartosz Dziewoński: [C:03+1] rest-gateway: switch to Apr2026 rate limit policy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277709 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [20:11:51] !log dancy@deploy1003 dancy, sfaci: Backport for [[gerrit:1277656|JS SDK: Add aliases for compatibility with existing experiment code (T419513)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:12:55] dancy: Tested! Working fine! [20:13:01] ok! [20:13:02] !log dancy@deploy1003 dancy, sfaci: Continuing with deployment [20:14:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:15:16] (03CR) 10Ssingh: [C:03+1] service::catalog: sophroid: set state to production [puppet] - 10https://gerrit.wikimedia.org/r/1277726 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [20:15:44] (03CR) 10Jasmine: [C:03+2] service::catalog: sophroid: set state to production [puppet] - 10https://gerrit.wikimedia.org/r/1277726 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [20:15:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11863011 (10VRiley-WMF) [20:16:22] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING WARNING - Packet loss = 77%, RTA = 71.04 ms [20:16:40] (03PS2) 10Ebernhardson: cirrus: AB test query suggester variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277701 (https://phabricator.wikimedia.org/T407432) [20:16:54] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277656|JS SDK: Add aliases for compatibility with existing experiment code (T419513)]] (duration: 06m 38s) [20:16:58] T419513: JS SDK: Read everyone experiment enrollment from the WMF-Uniq server timing header - https://phabricator.wikimedia.org/T419513 [20:17:37] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti1055.eqiad.wmnet with OS bookworm [20:17:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11863019 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host ganeti1055.eqiad.wmnet with OS bookworm [20:18:05] (03CR) 10Bking: [C:03+2] wdqs: enable new discovery intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1277713 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [20:19:33] dancy: Thank you very much!!! [20:19:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:24:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:28:38] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: switch to Apr2026 rate limit policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277709 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [20:30:07] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1210: Repooling [20:36:08] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1277676 (owner: 10RLazarus) [20:36:24] (03CR) 10RLazarus: [C:03+2] admin: Add rzl backup Yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1277676 (owner: 10RLazarus) [20:44:39] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:45:23] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1210: Repooling [20:47:04] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1055.eqiad.wmnet with reason: host reimage [20:48:02] (03CR) 10Andrew Bogott: [C:03+2] Remove openstack::[client|server]packages::flamingo::bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1276010 (owner: 10Andrew Bogott) [20:49:19] (03PS1) 10Jasmine: sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) [20:49:39] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:50:01] (03CR) 10CI reject: [V:04-1] sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [20:51:24] Dear CDN, now is a really bad time to keep blocking CI's access to gerrit. [20:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:25] (03PS2) 10Jasmine: sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) [20:54:25] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1055.eqiad.wmnet with reason: host reimage [20:54:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:55:23] (03CR) 10CI reject: [V:04-1] sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [20:55:40] (03PS1) 10Andrew Bogott: Designate: use zookeeper as the tooz backend, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1277747 (https://phabricator.wikimedia.org/T422646) [20:57:33] (03PS2) 10Andrew Bogott: Designate: use zookeeper as the tooz backend, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1277747 (https://phabricator.wikimedia.org/T422646) [20:58:58] (03PS3) 10Jasmine: sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) [20:59:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:59:42] (03CR) 10CI reject: [V:04-1] sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [21:00:05] Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T2100). [21:00:05] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 232 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 760, active_shards: 1301, relocating_shards: 2, initializing_shards: 78, unassigned_shards: 154, delayed_unassigned_shards: 0 [21:00:05] _of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2640, active_shards_percent_as_number: 84.86627527723418 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:05] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 232 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 760, active_shards: 1301, relocating_shards: 2, initializing_shards: 78, unassigned_shards: 154, delayed_unassigned_shards: 0 [21:00:05] _of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2667, active_shards_percent_as_number: 84.86627527723418 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:05] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 232 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 760, active_shards: 1301, relocating_shards: 2, initializing_shards: 78, unassigned_shards: 154, delayed_unassigned_shards: 0 [21:00:05] _of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2706, active_shards_percent_as_number: 84.86627527723418 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:07] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 232 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1301, relocating_shards: 2, initializing_shards: 78, unassigned_shards: [21:00:07] layed_unassigned_shards: 0, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 4705, active_shards_percent_as_number: 84.86627527723418 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:07] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 232 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1301, relocating_shards: 2, initializing_shards: 78, unassigned_shards: [21:00:07] layed_unassigned_shards: 0, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 4812, active_shards_percent_as_number: 84.86627527723418 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:33] (03PS4) 10Jasmine: sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) [21:01:05] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 760, active_shards: 1312, relocating_shards: 2, initializing_shards: 76, unassigned_shards: 145, delayed_unassigned_shards: 0, number_of_pending_task [21:01:05] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.58382257012394 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:01:05] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 760, active_shards: 1312, relocating_shards: 2, initializing_shards: 78, unassigned_shards: 143, delayed_unassigned_shards: 0, number_of_pending_task [21:01:05] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.58382257012394 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:01:05] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 760, active_shards: 1312, relocating_shards: 2, initializing_shards: 78, unassigned_shards: 143, delayed_unassigned_shards: 0, number_of_pending_task [21:01:05] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.58382257012394 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:01:07] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1312, relocating_shards: 2, initializing_shards: 78, unassigned_shards: 143, delayed_unassigned [21:01:08] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.58382257012394 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:01:08] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1312, relocating_shards: 2, initializing_shards: 78, unassigned_shards: 143, delayed_unassigned [21:01:08] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.58382257012394 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:01:16] (03CR) 10CI reject: [V:04-1] sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [21:01:53] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:02:15] (03CR) 10CDanis: mwscript-k8s: add --output-file flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis) [21:02:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277747 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [21:02:57] (03PS5) 10Jasmine: sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) [21:03:53] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:04:25] (03CR) 10Ryan Kemper: [C:03+1] wdqs: enable new discovery intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1277713 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [21:04:39] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:04:45] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1337, relocating_shards: 24, initializing_shards: 68, unassigned_shards: 128, delayed_unassigne [21:04:45] : 0, number_of_pending_tasks: 57, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 86151, active_shards_percent_as_number: 87.21461187214612 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:07:38] Deploying security patches for T422306 [21:08:36] (03CR) 10Ssingh: [C:03+1] sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [21:08:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [21:09:39] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:13:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:39] !log jasmine@deploy1003 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=sophroid [21:13:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [21:13:53] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [21:14:18] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [21:14:19] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1055.eqiad.wmnet with OS bookworm [21:14:25] (03CR) 10Jasmine: [C:03+2] sophroid: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/1277743 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [21:14:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11863208 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host ganeti1055.eqiad.wmnet with OS bookworm completed: - ganeti... [21:14:48] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti1056.eqiad.wmnet with OS bookworm [21:14:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11863210 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host ganeti1056.eqiad.wmnet with OS bookworm [21:14:57] !log jasmine@dns1004 START - running authdns-update [21:16:33] !log jasmine@dns1004 END - running authdns-update [21:18:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:14] (03CR) 10RLazarus: mwscript-k8s: add --output-file flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis) [21:25:49] preparing to deploy another security patch [21:26:33] (03CR) 10ArielGlenn: [C:03+1] rest-gateway: switch to Apr2026 rate limit policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277709 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [21:26:47] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#11863247 (10Jgreen) >>! In T196336#11756366, @Dzahn wrote: > This issue is about **passive checks**. > > The historic reason fundraising-tech use... [21:27:06] !log Deployed security fix for T422306 [21:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:00] running scap [21:29:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:29:46] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1056.eqiad.wmnet with reason: host reimage [21:29:51] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2148.codfw.wmnet - https://phabricator.wikimedia.org/T424309#11863255 (10Jhancock.wm) a:03Jhancock.wm [21:30:14] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2148.codfw.wmnet - https://phabricator.wikimedia.org/T424309#11863259 (10Jhancock.wm) 05Open→03Resolved [21:34:10] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2145.codfw.wmnet - https://phabricator.wikimedia.org/T424177#11863282 (10Jhancock.wm) 05Open→03Resolved [21:34:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:35:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1056.eqiad.wmnet with reason: host reimage [21:36:04] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ms-backup2001 & ms-backup2002 - https://phabricator.wikimedia.org/T422852#11863297 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [21:38:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-be200[1-2].codfw.wmnet - https://phabricator.wikimedia.org/T423584#11863322 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [21:39:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:39:46] !log Deployed security fix for T422676 [21:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:45:27] (03PS7) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [21:49:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:53:58] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [21:54:22] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [21:54:23] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1056.eqiad.wmnet with OS bookworm [21:54:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11863404 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host ganeti1056.eqiad.wmnet with OS bookworm completed: - ganeti... [21:54:42] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti1057.eqiad.wmnet with OS bookworm [21:54:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11863405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host ganeti1057.eqiad.wmnet with OS bookworm [22:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 5d 15h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [22:07:55] (03PS8) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [22:09:06] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1057.eqiad.wmnet with reason: host reimage [22:09:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:16:01] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1057.eqiad.wmnet with reason: host reimage [22:16:19] (03CR) 10Bartosz Dziewoński: "I'd like to deploy this next week, after the dependency (change 1271967) has rolled out. Since it's a bit more complex than the usual conf" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271969 (https://phabricator.wikimedia.org/T418507) (owner: 10Bartosz Dziewoński) [22:24:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:29:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:31:47] 06SRE: API rate limit triggered for regular user - https://phabricator.wikimedia.org/T424588#11863498 (10A_smart_kitten) (tagging #sre; not immediately sure where this specific rate limit may be originating from) [22:32:51] (03CR) 10Anne Tomasevich: "LGTM but not voting yet since this needs to wait to be merged and deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277667 (https://phabricator.wikimedia.org/T421776) (owner: 10Stoyofuku-wmf) [22:34:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:34:39] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:34:42] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [22:35:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [22:35:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1057.eqiad.wmnet with OS bookworm [22:35:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11863504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host ganeti1057.eqiad.wmnet with OS bookworm completed: - ganeti... [22:36:31] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti1058.eqiad.wmnet with OS bookworm [22:36:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11863512 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host ganeti1058.eqiad.wmnet with OS bookworm [22:38:24] 06SRE, 10WMF-General-or-Unknown: API rate limit triggered for regular user - https://phabricator.wikimedia.org/T424588#11863514 (10XXBlackburnXx) [22:38:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [22:43:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [22:44:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:49:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:50:04] (03PS1) 10Arlolra: Deploy PRV to 10 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) [22:51:29] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1058.eqiad.wmnet with reason: host reimage [22:52:08] (03PS1) 10Dzahn: cache: add normal caching setting for zuul.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1277771 (https://phabricator.wikimedia.org/T395938) [22:53:03] (03CR) 10Dzahn: [C:03+2] cache: add normal caching setting for zuul.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1277771 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:54:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:00:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1058.eqiad.wmnet with reason: host reimage [23:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260427T2300) [23:04:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:05:21] (03PS9) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [23:09:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:11:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [23:14:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:18:02] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [23:21:08] vriley@cumin1003 reimage (PID 1761922) is awaiting input [23:24:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:25:08] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 266 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 814, active_shards: 1366, relocating_shards: 0, initializing_shards: 10, unassigned_shards: [23:25:08] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [23:25:08] layed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.70098039215686 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:25:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1058.eqiad.wmnet with OS bookworm [23:25:10] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 266 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 814, active_shards: 1366, relocating_shards: 0, initializing_shards: 10, unassigned_shards: [23:25:10] layed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.70098039215686 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:25:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11863634 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host ganeti1058.eqiad.wmnet with OS bookworm completed: - ganeti... [23:25:23] (03PS10) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [23:25:46] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f801c2ed550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [23:25:46] dia.org/wiki/Search%23Administration [23:25:46] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7ff2548ad550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [23:25:46] dia.org/wiki/Search%23Administration [23:26:06] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 822, active_shards: 1397, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 247, delayed_unassigned_shards: [23:26:06] r_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 339, active_shards_percent_as_number: 84.61538461538461 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:26:06] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 822, active_shards: 1397, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 247, delayed_unassigned_shards: [23:26:06] r_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 341, active_shards_percent_as_number: 84.61538461538461 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:26:06] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 822, active_shards: 1397, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 247, delayed_unassigned_shards: [23:26:06] r_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 338, active_shards_percent_as_number: 84.61538461538461 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:26:08] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 814, active_shards: 1412, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 210, delayed_unassigned [23:26:08] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.51960784313727 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:26:08] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 252 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 822, active_shards: 1399, relocating_shards: 0, initializing_shards: 10, unassigned_shard [23:26:08] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.73652331920049 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:26:10] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 814, active_shards: 1412, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 210, delayed_unassigned [23:26:10] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.51960784313727 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:26:10] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 251 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 822, active_shards: 1400, relocating_shards: 0, initializing_shards: 9, unassigned_shards [23:26:10] elayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 64, active_shards_percent_as_number: 84.79709267110842 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:26:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [23:27:06] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 822, active_shards: 1450, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pending_ [23:27:06] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.82556026650515 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:27:06] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 822, active_shards: 1450, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pending_ [23:27:06] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.82556026650515 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:27:06] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 822, active_shards: 1450, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pending_ [23:27:06] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.82556026650515 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:27:08] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 822, active_shards: 1450, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 191, delayed_unassi [23:27:08] rds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.82556026650515 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:27:10] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 822, active_shards: 1451, relocating_shards: 0, initializing_shards: 9, unassigned_shards: 191, delayed_unassig [23:27:10] ds: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 63, active_shards_percent_as_number: 87.88612961841308 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:28:48] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f376ebcd550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [23:28:48] dia.org/wiki/Search%23Administration [23:28:48] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd341da5550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [23:28:48] dia.org/wiki/Search%23Administration [23:30:25] FIRING: SystemdUnitFailed: opensearch_2@cloudelastic-psi-eqiad.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:50] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd1db0cd550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [23:31:50] dia.org/wiki/Search%23Administration [23:31:50] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 825, active_shards: 1570, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 79, delayed_unass [23:31:50] ards: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.0938824954573 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:32:28] ACKNOWLEDGEMENT - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King T422860 - The acknowledgement expires at: 2026-04-30 23:31:55. https://wikitech.wikimedia.org/wiki/Search%23Administration [23:32:28] ACKNOWLEDGEMENT - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King T422860 - The acknowledgement expires at: 2026-04-30 23:31:55. https://wikitech.wikimedia.org/wiki/Search%23Administration [23:32:28] ACKNOWLEDGEMENT - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King T422860 - The acknowledgement expires at: 2026-04-30 23:31:55. https://wikitech.wikimedia.org/wiki/Search%23Administration [23:32:28] ACKNOWLEDGEMENT - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King T422860 - The acknowledgement expires at: 2026-04-30 23:31:55. https://wikitech.wikimedia.org/wiki/Search%23Administration [23:32:28] ACKNOWLEDGEMENT - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King T422860 - The acknowledgement expires at: 2026-04-30 23:31:55. https://wikitech.wikimedia.org/wiki/Search%23Administration [23:32:28] ACKNOWLEDGEMENT - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King T422860 - The acknowledgement expires at: 2026-04-30 23:31:55. https://wikitech.wikimedia.org/wiki/Search%23Administration [23:32:50] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:33:18] ACKNOWLEDGEMENT - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 Brian_King T422860 - The acknowledgement expires at: 2026-04-30 23:33:10. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:34:04] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:35:32] RESOLVED: [2x] SystemdUnitFailed: opensearch_2@cloudelastic-psi-eqiad.servic.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:36:25] FIRING: [2x] SystemdUnitFailed: opensearch_2@cloudelastic-psi-eqiad.servic.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:27] (03PS1) 10Zabe: Undeploy GoogleNewsSitemap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277783 (https://phabricator.wikimedia.org/T421798) [23:37:30] (03CR) 10CDanis: mwscript-k8s: add --output-file flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis) [23:39:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1277784 [23:39:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1277784 (owner: 10TrainBranchBot) [23:46:00] (03PS11) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [23:46:25] RESOLVED: [2x] SystemdUnitFailed: opensearch_2@cloudelastic-psi-eqiad.servic.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11863737 (10VRiley-WMF) [23:49:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:51:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1277784 (owner: 10TrainBranchBot) [23:51:25] FIRING: [2x] SystemdUnitFailed: opensearch_2@cloudelastic-psi-eqiad.servic.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:52:45] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1623, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 9, delayed_unassigned [23:52:45] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.44852941176471 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:52:49] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:53:49] RESOLVED: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:54:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired