[00:05:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb2013.codfw.wmnet with OS trixie [00:05:18] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11853529 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb2013.codfw.wmnet with OS trixie execu... [00:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:49:38] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:59:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:02:03] jhancock@cumin2002 reimage (PID 3625297) is awaiting input [01:10:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1276841 [01:10:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1276841 (owner: 10TrainBranchBot) [01:21:55] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1276841 (owner: 10TrainBranchBot) [01:24:38] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:29:38] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:54:38] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:59:38] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:00:34] (03CR) 10Andrew Bogott: [C:03+2] setup_capi.sh.erb: update to resemble upstream guides for magnum-capi [puppet] - 10https://gerrit.wikimedia.org/r/1276820 (owner: 10Andrew Bogott) [02:00:42] (03CR) 10Andrew Bogott: [C:03+2] Magnum: switch codfw1dev from capi-helm to magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1276821 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [02:01:05] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:38] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 32s) [02:14:38] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:36:25] (03PS1) 10Andrew Bogott: magnum cluster-api: include k8s config file on codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1276844 (https://phabricator.wikimedia.org/T393782) [02:36:54] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276844 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [02:39:11] (03CR) 10Andrew Bogott: [C:03+2] magnum cluster-api: include k8s config file on codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1276844 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [02:48:24] FIRING: [14x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:59:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T410589)', diff saved to https://phabricator.wikimedia.org/P91380 and previous config saved to /var/cache/conftool/dbconfig/20260424-025930-ladsgroup.json [02:59:35] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:01:59] (03PS1) 10Andrew Bogott: magnum: update capi worker build process [puppet] - 10https://gerrit.wikimedia.org/r/1276845 [03:03:34] (03CR) 10Andrew Bogott: [C:03+2] magnum: update capi worker build process [puppet] - 10https://gerrit.wikimedia.org/r/1276845 (owner: 10Andrew Bogott) [03:04:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1019:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:04:51] FIRING: [15x] CertAlmostExpired: Certificate for service contint1002:1443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:05:09] FIRING: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:09:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1019:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:09:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P91381 and previous config saved to /var/cache/conftool/dbconfig/20260424-030938-ladsgroup.json [03:13:47] (03PS1) 10Andrew Bogott: magnum: fully/qualify path to kubectl in unless clause [puppet] - 10https://gerrit.wikimedia.org/r/1276846 [03:14:34] (03CR) 10Andrew Bogott: [C:03+2] magnum: fully/qualify path to kubectl in unless clause [puppet] - 10https://gerrit.wikimedia.org/r/1276846 (owner: 10Andrew Bogott) [03:19:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1019:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:19:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P91382 and previous config saved to /var/cache/conftool/dbconfig/20260424-031947-ladsgroup.json [03:29:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T410589)', diff saved to https://phabricator.wikimedia.org/P91383 and previous config saved to /var/cache/conftool/dbconfig/20260424-032955-ladsgroup.json [03:30:01] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:30:13] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [03:30:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2181 (T410589)', diff saved to https://phabricator.wikimedia.org/P91384 and previous config saved to /var/cache/conftool/dbconfig/20260424-033021-ladsgroup.json [03:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:39:38] FIRING: [15x] CertAlmostExpired: Certificate for service wdqs1019:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:54:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1018:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:33] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:30:40] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 33%, RTA = 2389.40 ms [04:31:16] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [04:34:33] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:39:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1018:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:54:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:56:12] (03PS1) 10Ayounsi: Add netflow5003 to the kafka brokers ACL [puppet] - 10https://gerrit.wikimedia.org/r/1276854 (https://phabricator.wikimedia.org/T421863) [05:19:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:22:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11853985 (10ayounsi) @bking the data needs to be manually re-created by copying the data in the links I shared previously. In that case it... [05:26:44] (03PS1) 10Marostegui: db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276856 (https://phabricator.wikimedia.org/T424309) [05:28:04] (03CR) 10Marostegui: [C:03+2] db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276856 (https://phabricator.wikimedia.org/T424309) (owner: 10Marostegui) [05:29:38] FIRING: [15x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:29:45] (03PS1) 10Marostegui: instances.yaml: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1276857 (https://phabricator.wikimedia.org/T424309) [05:32:18] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1276857 (https://phabricator.wikimedia.org/T424309) (owner: 10Marostegui) [05:33:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2148 from dbctl T424309', diff saved to https://phabricator.wikimedia.org/P91385 and previous config saved to /var/cache/conftool/dbconfig/20260424-053342-marostegui.json [05:33:47] T424309: decommission db2148.codfw.wmnet - https://phabricator.wikimedia.org/T424309 [05:34:38] FIRING: [15x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:36:58] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2145.codfw.wmnet - https://phabricator.wikimedia.org/T424177#11854031 (10Marostegui) [05:39:38] FIRING: [15x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:40:02] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 19165 [05:40:31] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 19165 [05:41:29] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 20940 [05:44:59] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 20940 [05:48:08] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 58717 [05:48:58] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1276854 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [05:49:02] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58717 [05:49:38] FIRING: [15x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:53:36] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 264595 [05:54:02] (03CR) 10Ayounsi: [C:03+2] Add netflow5003 to the kafka brokers ACL [puppet] - 10https://gerrit.wikimedia.org/r/1276854 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [05:56:43] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264595 [05:56:53] (03PS1) 10Muehlenhoff: Add hcaptcha-proxy5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1276859 (https://phabricator.wikimedia.org/T421863) [05:57:17] (03PS1) 10Ayounsi: netflow5003: apply role netinsights [puppet] - 10https://gerrit.wikimedia.org/r/1276862 (https://phabricator.wikimedia.org/T421863) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260424T0600) [06:06:58] (03PS1) 10Muehlenhoff: Assign netinsights role for netflow5003 [puppet] - 10https://gerrit.wikimedia.org/r/1276865 (https://phabricator.wikimedia.org/T421863) [06:08:03] (03CR) 10Muehlenhoff: [C:03+2] Assign netinsights role for netflow5003 [puppet] - 10https://gerrit.wikimedia.org/r/1276865 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [06:09:38] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:13:41] (03PS1) 10Marostegui: installserver: Do not format pc2022 [puppet] - 10https://gerrit.wikimedia.org/r/1276866 [06:18:21] (03CR) 10Marostegui: [C:03+2] installserver: Do not format pc2022 [puppet] - 10https://gerrit.wikimedia.org/r/1276866 (owner: 10Marostegui) [06:24:37] FIRING: [15x] CertAlmostExpired: Certificate for service contint1002:1443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:24:42] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:28:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db[2142-2143].codfw.wmnet with reason: Cloning [06:29:38] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:36:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1276674 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [06:37:42] (03CR) 10Ayounsi: [C:03+2] eqsin: update netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/1276674 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [06:38:17] (03CR) 10Ayounsi: [C:03+1] Add hcaptcha-proxy5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1276859 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [06:39:02] (03Merged) 10jenkins-bot: eqsin: update netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/1276674 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [06:39:38] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:48:24] FIRING: [14x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:58:01] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: make UncoreFrequency dynamic for iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1275889 (https://phabricator.wikimedia.org/T418899) (owner: 10Elukey) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260424T0700) [07:00:21] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8459/co" [puppet] - 10https://gerrit.wikimedia.org/r/1276745 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [07:01:21] (03CR) 10Ryan Kemper: [C:03+1] cloudelastic: set role-level hiera for OpenSearch 2/Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1276818 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [07:03:00] (03CR) 10Elukey: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1276745 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [07:04:38] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:05:09] FIRING: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:06:44] (03CR) 10JavierMonton: alert: mw-page-html-content-change-enrich (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [07:09:38] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:19:21] (03CR) 10Phuedx: [C:03+1] Add script to get constructive edits for all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1272633 (https://phabricator.wikimedia.org/T422736) (owner: 10Clare Ming) [07:21:06] (03PS1) 10Elukey: services: enable ingress for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276872 (https://phabricator.wikimedia.org/T424193) [07:21:28] (03CR) 10Muehlenhoff: [C:03+2] Add hcaptcha-proxy5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1276859 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:24:38] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:28:39] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 4670.20 ms [07:29:31] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [07:31:54] (03PS2) 10Elukey: services: enable ingress for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276872 (https://phabricator.wikimedia.org/T424193) [07:31:54] (03PS1) 10Elukey: charts: add ingress support to function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276873 (https://phabricator.wikimedia.org/T424193) [07:32:50] (03PS1) 10Ayounsi: eqsin conftool: remove decom tcp-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1276874 (https://phabricator.wikimedia.org/T421863) [07:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:35:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1276874 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [07:36:52] (03CR) 10Ayounsi: [C:03+2] eqsin conftool: remove decom tcp-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1276874 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [07:37:30] (03CR) 10Ayounsi: [C:03+2] "Deploying as I think it's breaking Puppet on prometheus5002:" [puppet] - 10https://gerrit.wikimedia.org/r/1276874 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [07:38:26] (03PS1) 10Muehlenhoff: profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) [07:38:55] (03CR) 10CI reject: [V:04-1] profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [07:39:00] (03Abandoned) 10Ayounsi: Remove tcp-proxy5001/5002 from conftool [puppet] - 10https://gerrit.wikimedia.org/r/1276709 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:39:02] (03CR) 10Filippo Giunchedi: Add upstream repos for openstack flamingo and gazpacho (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [07:40:16] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [07:41:11] (03PS6) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [07:41:20] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [07:42:07] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [07:43:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T424064#11854206 (10Marostegui) @Jclark-ctr could we get this replaced before the weekend? Thanks! [07:45:06] (03PS7) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [07:45:16] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [07:45:37] !log ayounsi@cumin1003 START - Cookbook sre.hosts.decommission for hosts netflow5002.eqsin.wmnet [07:45:55] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts netflow5002.eqsin.wmnet [07:45:59] (03PS2) 10Muehlenhoff: profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) [07:46:29] (03PS7) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [07:46:34] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1201 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1276877 (https://phabricator.wikimedia.org/T424315) [07:46:38] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [07:46:40] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1276878 (https://phabricator.wikimedia.org/T424315) [07:46:50] (03PS1) 10Ayounsi: Remove netflow5002 from kafka broker ACL [puppet] - 10https://gerrit.wikimedia.org/r/1276879 (https://phabricator.wikimedia.org/T421863) [07:47:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1276879 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [07:48:11] (03CR) 10Ayounsi: [C:03+2] Remove netflow5002 from kafka broker ACL [puppet] - 10https://gerrit.wikimedia.org/r/1276879 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [07:50:34] !log ayounsi@cumin1003 START - Cookbook sre.hosts.decommission for hosts netflow5002.eqsin.wmnet [07:51:38] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [07:51:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1190 (T419961)', diff saved to https://phabricator.wikimedia.org/P91386 and previous config saved to /var/cache/conftool/dbconfig/20260424-075145-fceratto.json [07:52:46] (03Abandoned) 10Muehlenhoff: netflow5003: apply role netinsights [puppet] - 10https://gerrit.wikimedia.org/r/1276862 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [07:54:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy5003.wikimedia.org [07:54:38] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:59:18] FIRING: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:59:19] (03PS1) 10Ilias Sarantopoulos: Add gRPC support to Istio ingress gateway for ML services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T423582) [08:00:00] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:00:24] jmm@cumin2002 makevm (PID 3939294) is awaiting input [08:00:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T419961)', diff saved to https://phabricator.wikimedia.org/P91388 and previous config saved to /var/cache/conftool/dbconfig/20260424-080025-fceratto.json [08:01:00] (03CR) 10Klausman: [V:03+1 C:03+2] manifests/hiera: Move ml-serve101[45] to k8s worker role [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [08:03:09] (03PS2) 10Ilias Sarantopoulos: Add gRPC support to Istio ingress gateway for ML services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T424049) [08:03:54] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003" [08:04:33] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:04:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:05:07] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003" [08:05:07] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:05:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow5002.eqsin.wmnet [08:05:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11854281 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1003 for hosts: `netflow5002.eqsin.wmnet` - netflow5002.eqsin.wmnet (**PA... [08:05:26] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:04] (03PS3) 10Muehlenhoff: profile::syslog::centralserver: Readd acme support [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) [08:08:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:08:11] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy5003.wikimedia.org on all recursors [08:08:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy5003.wikimedia.org on all recursors [08:08:25] FIRING: SystemdUnitFailed: dragonfly-dfdaemon.service on ml-serve1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:38] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:08:43] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11854297 (10atsuko) a:05MMiller_WMF→03atsuko [08:08:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:08:56] (03PS2) 10Atsuko: admin: Add jmoore111 to the analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1275826 (https://phabricator.wikimedia.org/T422963) [08:09:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:10:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P91390 and previous config saved to /var/cache/conftool/dbconfig/20260424-081033-fceratto.json [08:11:26] (03CR) 10Atsuko: [C:03+2] admin: Add jmoore111 to the analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1275826 (https://phabricator.wikimedia.org/T422963) (owner: 10Atsuko) [08:12:00] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1014:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [08:12:25] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy5003.wikimedia.org - jmm@cumin2002" [08:12:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy5003.wikimedia.org - jmm@cumin2002" [08:12:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:12:31] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy5003.wikimedia.org on all recursors [08:12:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy5003.wikimedia.org on all recursors [08:12:42] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy5003.wikimedia.org [08:13:25] RESOLVED: SystemdUnitFailed: dragonfly-dfdaemon.service on ml-serve1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:22] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11854302 (10atsuko) 05In progress→03Resolved merged, will roll-out within an hour on periodic puppet run [08:14:27] (03PS1) 10Klausman: manifests/amd_gpu: Bump trixie firmwware package version [puppet] - 10https://gerrit.wikimedia.org/r/1277046 [08:14:52] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11854304 (10atsuko) [08:14:53] (03PS2) 10Klausman: manifests/amd_gpu: Bump trixie firmwware package version [puppet] - 10https://gerrit.wikimedia.org/r/1277046 [08:15:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [08:15:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1169 (T419635)', diff saved to https://phabricator.wikimedia.org/P91391 and previous config saved to /var/cache/conftool/dbconfig/20260424-081539-fceratto.json [08:15:43] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:16:09] (03CR) 10Elukey: [C:03+1] manifests/amd_gpu: Bump trixie firmwware package version [puppet] - 10https://gerrit.wikimedia.org/r/1277046 (owner: 10Klausman) [08:16:43] (03CR) 10Klausman: [V:03+2 C:03+2] manifests/amd_gpu: Bump trixie firmwware package version [puppet] - 10https://gerrit.wikimedia.org/r/1277046 (owner: 10Klausman) [08:17:00] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1014:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [08:17:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [08:18:31] (03PS8) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [08:18:31] (03PS8) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [08:18:52] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:19:01] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:19:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy5003.wikimedia.org [08:19:19] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:19:54] (03CR) 10CI reject: [V:04-1] profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:20:25] FIRING: SystemdUnitFailed: dragonfly-dfdaemon.service on ml-serve1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P91392 and previous config saved to /var/cache/conftool/dbconfig/20260424-082041-fceratto.json [08:22:06] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1014.eqiad.wmnet [08:23:30] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317 (10ArthurTaylor) 03NEW [08:24:05] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy5003.wikimedia.org - jmm@cumin2002" [08:24:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy5003.wikimedia.org - jmm@cumin2002" [08:24:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:24:11] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy5003.wikimedia.org on all recursors [08:24:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy5003.wikimedia.org on all recursors [08:24:18] RESOLVED: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:24:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:24:59] (03CR) 10Elukey: [C:04-1] "Need to rework this, since removing the "discovery" explicit bit means moving the function call to named parameters. Given the fact that w" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:25:25] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:11] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1014.eqiad.wmnet [08:28:40] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11854340 (10karapayneWMDE) Approved on my side! [08:29:30] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy5003.wikimedia.org - jmm@cumin2002" [08:29:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy5003.wikimedia.org - jmm@cumin2002" [08:29:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:29:37] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy5003.wikimedia.org on all recursors [08:29:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy5003.wikimedia.org on all recursors [08:29:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy5003.wikimedia.org [08:30:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T419961)', diff saved to https://phabricator.wikimedia.org/P91393 and previous config saved to /var/cache/conftool/dbconfig/20260424-083050-fceratto.json [08:31:10] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [08:31:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1199 (T419961)', diff saved to https://phabricator.wikimedia.org/P91394 and previous config saved to /var/cache/conftool/dbconfig/20260424-083118-fceratto.json [08:34:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T419635)', diff saved to https://phabricator.wikimedia.org/P91395 and previous config saved to /var/cache/conftool/dbconfig/20260424-083406-fceratto.json [08:34:10] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:34:31] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, will unbreak puppet on syslog-server-audit*.cloudinfra" [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [08:38:08] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11854365 (10MoritzMuehlenhoff) [08:40:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11854370 (10MoritzMuehlenhoff) [08:41:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11854371 (10MoritzMuehlenhoff) [08:42:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T419961)', diff saved to https://phabricator.wikimedia.org/P91396 and previous config saved to /var/cache/conftool/dbconfig/20260424-084213-fceratto.json [08:42:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11854373 (10MoritzMuehlenhoff) [08:43:35] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11854374 (10ayounsi) [08:44:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P91397 and previous config saved to /var/cache/conftool/dbconfig/20260424-084414-fceratto.json [08:44:38] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:50:25] RESOLVED: SystemdUnitFailed: dragonfly-dfdaemon.service on ml-serve1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P91398 and previous config saved to /var/cache/conftool/dbconfig/20260424-085221-fceratto.json [08:54:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P91399 and previous config saved to /var/cache/conftool/dbconfig/20260424-085421-fceratto.json [08:54:38] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:56:32] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [09:01:45] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [09:02:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P91400 and previous config saved to /var/cache/conftool/dbconfig/20260424-090229-fceratto.json [09:04:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T419635)', diff saved to https://phabricator.wikimedia.org/P91401 and previous config saved to /var/cache/conftool/dbconfig/20260424-090429-fceratto.json [09:04:34] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:04:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [09:04:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1184 (T419635)', diff saved to https://phabricator.wikimedia.org/P91402 and previous config saved to /var/cache/conftool/dbconfig/20260424-090454-fceratto.json [09:11:36] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [09:12:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T419961)', diff saved to https://phabricator.wikimedia.org/P91403 and previous config saved to /var/cache/conftool/dbconfig/20260424-091237-fceratto.json [09:12:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [09:13:09] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1024-1025].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:13:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T419961)', diff saved to https://phabricator.wikimedia.org/P91404 and previous config saved to /var/cache/conftool/dbconfig/20260424-091316-fceratto.json [09:15:28] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-04-24 - 2026-05-15), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11854450 (10Gehel) [09:15:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11854452 (10Gehel) [09:16:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11854470 (10Gehel) [09:16:48] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [09:17:07] 10SRE-SLO, 10observability, 10Wikidata, 06Wikidata Platform Team, and 3 others: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11854486 (10Gehel) [09:17:52] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11854504 (10Gehel) [09:18:18] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11854516 (10Gehel) [09:18:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11854518 (10Gehel) [09:18:37] 06SRE, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11854520 (10Gehel) [09:19:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11854536 (10Gehel) [09:19:19] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [09:19:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11854543 (10Gehel) [09:19:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11854544 (10Gehel) [09:20:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11854548 (10Gehel) [09:20:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11854549 (10Gehel) [09:21:20] !log cmooney@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1014.eqiad.wmnet [09:21:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T419961)', diff saved to https://phabricator.wikimedia.org/P91405 and previous config saved to /var/cache/conftool/dbconfig/20260424-092135-fceratto.json [09:21:42] (03PS2) 10Arnaudb: gerrit: predict_linear alert for diskspace [alerts] - 10https://gerrit.wikimedia.org/r/1277048 (https://phabricator.wikimedia.org/T423601) [09:21:46] (03CR) 10Arnaudb: "What's returned by the query over the last diskspace incident: https://grafana.wikimedia.org/goto/afk1jrg51730gc?orgId=1" [alerts] - 10https://gerrit.wikimedia.org/r/1277048 (https://phabricator.wikimedia.org/T423601) (owner: 10Arnaudb) [09:23:00] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [09:23:19] (03PS9) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [09:23:19] (03PS9) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [09:23:49] (03CR) 10CI reject: [V:04-1] profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:24:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T419635)', diff saved to https://phabricator.wikimedia.org/P91406 and previous config saved to /var/cache/conftool/dbconfig/20260424-092401-fceratto.json [09:24:05] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:24:24] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [09:26:48] (03PS1) 10Muehlenhoff: Apply ncredir role to ncredir5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1277051 (https://phabricator.wikimedia.org/T421863) [09:27:44] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change records for ml-serve1014 - cmooney@cumin1003" [09:27:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change records for ml-serve1014 - cmooney@cumin1003" [09:27:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:28:13] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache ml-serve1014.eqiad.wmnet on all recursors [09:28:17] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve1014.eqiad.wmnet on all recursors [09:29:37] FIRING: [15x] CertAlmostExpired: Certificate for service contint1002:1443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:29:38] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:31:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P91407 and previous config saved to /var/cache/conftool/dbconfig/20260424-093143-fceratto.json [09:34:03] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ml-serve1014.eqiad.wmnet [09:34:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P91408 and previous config saved to /var/cache/conftool/dbconfig/20260424-093409-fceratto.json [09:37:00] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1014:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [09:40:02] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [09:41:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P91409 and previous config saved to /var/cache/conftool/dbconfig/20260424-094151-fceratto.json [09:42:00] RESOLVED: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1014:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [09:42:57] (03PS10) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [09:42:57] (03PS10) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [09:44:00] (03CR) 10Ayounsi: [C:03+1] Apply ncredir role to ncredir5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1277051 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [09:44:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P91410 and previous config saved to /var/cache/conftool/dbconfig/20260424-094417-fceratto.json [09:44:30] (03CR) 10CI reject: [V:04-1] profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:49:38] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:49:42] (03PS11) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [09:49:42] (03PS11) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [09:50:52] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [09:51:19] (03CR) 10CI reject: [V:04-1] profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:51:50] FIRING: KubernetesCalicoDown: ml-serve1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1015.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:52:00] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [09:52:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T419961)', diff saved to https://phabricator.wikimedia.org/P91411 and previous config saved to /var/cache/conftool/dbconfig/20260424-095159-fceratto.json [09:52:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [09:52:22] !log cmooney@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [09:52:24] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [09:52:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1238 (T419961)', diff saved to https://phabricator.wikimedia.org/P91412 and previous config saved to /var/cache/conftool/dbconfig/20260424-095228-fceratto.json [09:54:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T419635)', diff saved to https://phabricator.wikimedia.org/P91413 and previous config saved to /var/cache/conftool/dbconfig/20260424-095425-fceratto.json [09:54:30] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:54:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [09:54:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1186 (T419635)', diff saved to https://phabricator.wikimedia.org/P91414 and previous config saved to /var/cache/conftool/dbconfig/20260424-095450-fceratto.json [09:56:43] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change records for ml-serve1014 - cmooney@cumin1003" [09:56:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change records for ml-serve1014 - cmooney@cumin1003" [09:56:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:56:59] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache ml-serve1015.eqiad.wmnet on all recursors [09:57:03] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve1015.eqiad.wmnet on all recursors [09:57:11] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [09:57:20] RESOLVED: KubernetesCalicoDown: ml-serve1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1015.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:58:08] PROBLEM - Host ml-serve1015 is DOWN: PING CRITICAL - Packet loss = 100% [09:59:38] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1019:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:59:50] FIRING: KubernetesCalicoDown: ml-serve1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1015.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:00:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T419961)', diff saved to https://phabricator.wikimedia.org/P91415 and previous config saved to /var/cache/conftool/dbconfig/20260424-100047-fceratto.json [10:00:57] 07sre-alert-triage, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414413#11854744 (10Gehel) [10:02:00] RESOLVED: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [10:02:25] FIRING: [5x] SystemdUnitFailed: cadvisor.service on ml-serve1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:46] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T424175 [10:04:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1019:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:04:50] RESOLVED: KubernetesCalicoDown: ml-serve1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1015.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:07:25] RESOLVED: [5x] SystemdUnitFailed: cadvisor.service on ml-serve1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:00] RECOVERY - Host ml-serve1015 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [10:11:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P91416 and previous config saved to /var/cache/conftool/dbconfig/20260424-101056-fceratto.json [10:11:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T419635)', diff saved to https://phabricator.wikimedia.org/P91417 and previous config saved to /var/cache/conftool/dbconfig/20260424-101146-fceratto.json [10:11:51] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:12:02] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [10:15:30] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [10:17:37] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [10:20:12] (03PS1) 10Marostegui: db1159,db2228: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277060 (https://phabricator.wikimedia.org/T424323) [10:20:56] (03CR) 10Marostegui: [C:03+2] db1159,db2228: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277060 (https://phabricator.wikimedia.org/T424323) (owner: 10Marostegui) [10:21:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P91418 and previous config saved to /var/cache/conftool/dbconfig/20260424-102108-fceratto.json [10:21:14] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [10:21:21] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1159.eqiad.wmnet with reason: Reimage to Trixie [10:21:26] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1159: Reimage to Trixie [10:21:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2228.codfw.wmnet with reason: Reimage to Trixie [10:21:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2228: Reimage to Trixie [10:21:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1159: Reimage to Trixie [10:21:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2228: Reimage to Trixie [10:21:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P91421 and previous config saved to /var/cache/conftool/dbconfig/20260424-102154-fceratto.json [10:22:50] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1159.eqiad.wmnet with OS trixie [10:22:57] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2228.codfw.wmnet with OS trixie [10:26:44] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [10:29:36] (03PS1) 10Filippo Giunchedi: kubeadm: quote kubectl arguments [puppet] - 10https://gerrit.wikimedia.org/r/1277065 (https://phabricator.wikimedia.org/T420565) [10:29:38] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1017:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:30:15] (03CR) 10CI reject: [V:04-1] kubeadm: quote kubectl arguments [puppet] - 10https://gerrit.wikimedia.org/r/1277065 (https://phabricator.wikimedia.org/T420565) (owner: 10Filippo Giunchedi) [10:30:19] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [10:31:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T419961)', diff saved to https://phabricator.wikimedia.org/P91422 and previous config saved to /var/cache/conftool/dbconfig/20260424-103116-fceratto.json [10:31:26] (03PS2) 10Filippo Giunchedi: kubeadm: quote kubectl arguments [puppet] - 10https://gerrit.wikimedia.org/r/1277065 (https://phabricator.wikimedia.org/T420565) [10:31:38] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [10:31:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1241 (T419961)', diff saved to https://phabricator.wikimedia.org/P91423 and previous config saved to /var/cache/conftool/dbconfig/20260424-103146-fceratto.json [10:32:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P91424 and previous config saved to /var/cache/conftool/dbconfig/20260424-103202-fceratto.json [10:32:27] (03PS3) 10JavierMonton: alert: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) [10:32:38] (03CR) 10Filippo Giunchedi: "CI weeps about the long commit, though this is the upstream commit link https://github.com/postfinance/kubectl-sudo/commit/1061a7fde18508f" [puppet] - 10https://gerrit.wikimedia.org/r/1277065 (https://phabricator.wikimedia.org/T420565) (owner: 10Filippo Giunchedi) [10:33:51] (03CR) 10CI reject: [V:04-1] alert: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [10:34:09] (03PS2) 10JavierMonton: alerts: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) [10:35:29] (03CR) 10CI reject: [V:04-1] alerts: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [10:36:28] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1159.eqiad.wmnet with reason: host reimage [10:38:18] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2228.codfw.wmnet with reason: host reimage [10:40:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T419961)', diff saved to https://phabricator.wikimedia.org/P91426 and previous config saved to /var/cache/conftool/dbconfig/20260424-104016-fceratto.json [10:41:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1159.eqiad.wmnet with reason: host reimage [10:42:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T419635)', diff saved to https://phabricator.wikimedia.org/P91427 and previous config saved to /var/cache/conftool/dbconfig/20260424-104210-fceratto.json [10:42:16] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:42:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1195.eqiad.wmnet with reason: Maintenance [10:42:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1195 (T419635)', diff saved to https://phabricator.wikimedia.org/P91428 and previous config saved to /var/cache/conftool/dbconfig/20260424-104235-fceratto.json [10:48:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2228.codfw.wmnet with reason: host reimage [10:48:24] FIRING: [14x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:50:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P91429 and previous config saved to /var/cache/conftool/dbconfig/20260424-105023-fceratto.json [10:50:47] (03PS4) 10JavierMonton: alert: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) [10:52:05] (03PS3) 10JavierMonton: alerts: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) [10:52:33] (03CR) 10JavierMonton: alert: mw-page-html-content-change-enrich (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [10:54:15] (03CR) 10Majavah: "question inline, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [10:54:38] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1017:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:54:42] (03CR) 10JavierMonton: [C:03+2] alert: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [10:55:41] (03PS1) 10Marostegui: Revert "db1159,db2228: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277069 [10:55:58] (03Merged) 10jenkins-bot: alert: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [10:56:17] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [10:56:42] (03PS7) 10Daniel Kinzler: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) [10:59:48] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260424T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GitLab version upgrades . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260424T1100). [11:00:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P91430 and previous config saved to /var/cache/conftool/dbconfig/20260424-110031-fceratto.json [11:01:03] aokoth@cumin1003 aokoth: The backup on gitlab1004 is complete, ready to proceed with upgrade. [11:01:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T419635)', diff saved to https://phabricator.wikimedia.org/P91431 and previous config saved to /var/cache/conftool/dbconfig/20260424-110125-fceratto.json [11:01:29] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:03:01] (03CR) 10Marostegui: [C:03+2] Revert "db1159,db2228: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277069 (owner: 10Marostegui) [11:05:10] FIRING: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:05:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1159.eqiad.wmnet with OS trixie [11:06:42] (03PS6) 10Jcrespo: mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) [11:07:06] (03CR) 10Jcrespo: [C:04-2] "Wating to test dump and snapshot backups." [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [11:08:26] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1159: after reimage to trixie [11:10:12] (03CR) 10Jcrespo: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [11:10:29] (03PS7) 10Jcrespo: mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) [11:10:37] (03CR) 10Jcrespo: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [11:10:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T419961)', diff saved to https://phabricator.wikimedia.org/P91433 and previous config saved to /var/cache/conftool/dbconfig/20260424-111039-fceratto.json [11:11:01] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [11:11:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2228.codfw.wmnet with OS trixie [11:11:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T419961)', diff saved to https://phabricator.wikimedia.org/P91434 and previous config saved to /var/cache/conftool/dbconfig/20260424-111108-fceratto.json [11:11:18] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T424175 [11:11:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P91436 and previous config saved to /var/cache/conftool/dbconfig/20260424-111132-fceratto.json [11:12:02] (03PS8) 10Jcrespo: mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) [11:12:29] (03CR) 10Jcrespo: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [11:13:02] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [11:13:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2228: after reimage to trixie [11:16:31] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [11:19:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T419961)', diff saved to https://phabricator.wikimedia.org/P91439 and previous config saved to /var/cache/conftool/dbconfig/20260424-111931-fceratto.json [11:21:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P91440 and previous config saved to /var/cache/conftool/dbconfig/20260424-112141-fceratto.json [11:29:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs2010:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:29:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P91443 and previous config saved to /var/cache/conftool/dbconfig/20260424-112939-fceratto.json [11:31:34] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [11:31:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T419635)', diff saved to https://phabricator.wikimedia.org/P91444 and previous config saved to /var/cache/conftool/dbconfig/20260424-113149-fceratto.json [11:31:54] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:32:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance [11:32:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:32:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1196 (T419635)', diff saved to https://phabricator.wikimedia.org/P91445 and previous config saved to /var/cache/conftool/dbconfig/20260424-113235-fceratto.json [11:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:34:28] (03CR) 10Elukey: [C:03+2] admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [11:34:38] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs2010:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:39:38] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs2010:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:39:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P91447 and previous config saved to /var/cache/conftool/dbconfig/20260424-113948-fceratto.json [11:41:55] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277075 [11:44:32] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [11:44:38] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs2010:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:49:29] (03CR) 10Elukey: Add gRPC support to Istio ingress gateway for ML services (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T424049) (owner: 10Ilias Sarantopoulos) [11:49:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T419961)', diff saved to https://phabricator.wikimedia.org/P91450 and previous config saved to /var/cache/conftool/dbconfig/20260424-114956-fceratto.json [11:50:18] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance [11:50:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1243 (T419961)', diff saved to https://phabricator.wikimedia.org/P91451 and previous config saved to /var/cache/conftool/dbconfig/20260424-115025-fceratto.json [11:50:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T419635)', diff saved to https://phabricator.wikimedia.org/P91452 and previous config saved to /var/cache/conftool/dbconfig/20260424-115036-fceratto.json [11:50:41] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:53:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1159: after reimage to trixie [11:54:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [11:58:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2228: after reimage to trixie [11:58:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T419961)', diff saved to https://phabricator.wikimedia.org/P91455 and previous config saved to /var/cache/conftool/dbconfig/20260424-115845-fceratto.json [11:59:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [12:00:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P91456 and previous config saved to /var/cache/conftool/dbconfig/20260424-120045-fceratto.json [12:04:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:08:09] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:08:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P91458 and previous config saved to /var/cache/conftool/dbconfig/20260424-120854-fceratto.json [12:09:19] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T424064#11855082 (10Jclark-ctr) Replaced Failed drive ` Device Description Disk 5 in Backplane 1 of Integrated RAID Controller 1 Operational State Rebuilding ` [12:10:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P91460 and previous config saved to /var/cache/conftool/dbconfig/20260424-121053-fceratto.json [12:13:42] (03PS3) 10Ilias Sarantopoulos: Add gRPC support to Istio ingress gateway for ML services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T424049) [12:16:09] (03PS8) 10Daniel Kinzler: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) [12:17:05] (03CR) 10Ilias Sarantopoulos: Add gRPC support to Istio ingress gateway for ML services (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T424049) (owner: 10Ilias Sarantopoulos) [12:17:22] !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:17:27] !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [12:17:32] (03CR) 10Daniel Kinzler: [C:04-1] "The diff still looks odd. I think it's just grouping the changes in an unintuitive way, but it's worth a closer look before deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [12:17:47] !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [12:17:50] !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [12:18:13] (03PS9) 10Daniel Kinzler: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) [12:19:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P91461 and previous config saved to /var/cache/conftool/dbconfig/20260424-121902-fceratto.json [12:21:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T419635)', diff saved to https://phabricator.wikimedia.org/P91462 and previous config saved to /var/cache/conftool/dbconfig/20260424-122100-fceratto.json [12:21:05] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:21:18] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance [12:21:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1206 (T419635)', diff saved to https://phabricator.wikimedia.org/P91463 and previous config saved to /var/cache/conftool/dbconfig/20260424-122125-fceratto.json [12:21:54] (03CR) 10Elukey: Add gRPC support to Istio ingress gateway for ML services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T424049) (owner: 10Ilias Sarantopoulos) [12:24:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T424064#11855135 (10Marostegui) Thank you so much [12:26:12] (03PS1) 10Marostegui: db1185,db2223: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277081 (https://phabricator.wikimedia.org/T424323) [12:26:54] (03CR) 10Marostegui: [C:03+2] db1185,db2223: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1277081 (https://phabricator.wikimedia.org/T424323) (owner: 10Marostegui) [12:27:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1185.eqiad.wmnet with reason: Reimage to Trixie [12:27:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2223.codfw.wmnet with reason: Reimage to Trixie [12:27:19] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1185: Reimage to Trixie [12:27:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2223: Reimage to Trixie [12:27:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1185: Reimage to Trixie [12:27:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2223: Reimage to Trixie [12:28:41] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2223.codfw.wmnet with OS trixie [12:28:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS trixie [12:29:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T419961)', diff saved to https://phabricator.wikimedia.org/P91466 and previous config saved to /var/cache/conftool/dbconfig/20260424-122910-fceratto.json [12:29:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [12:29:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1244 (T419961)', diff saved to https://phabricator.wikimedia.org/P91467 and previous config saved to /var/cache/conftool/dbconfig/20260424-122939-fceratto.json [12:33:12] (03CR) 10Ilias Sarantopoulos: Add gRPC support to Istio ingress gateway for ML services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T424049) (owner: 10Ilias Sarantopoulos) [12:34:04] (03PS5) 10Klausman: modules/amd_rocm: Add wrapper around devplugin binary [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) [12:34:38] FIRING: [3x] CertAlmostExpired: Certificate for service wdqs2011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:37:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T419635)', diff saved to https://phabricator.wikimedia.org/P91468 and previous config saved to /var/cache/conftool/dbconfig/20260424-123751-fceratto.json [12:37:55] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:39:39] (03PS6) 10Klausman: modules/amd_rocm: Add wrapper around devplugin binary [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) [12:40:39] (03PS7) 10Klausman: modules/amd_rocm: Add wrapper around devplugin binary [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) [12:42:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage [12:43:20] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) (owner: 10Klausman) [12:43:38] (03PS1) 10Muehlenhoff: Add component for forward port of Zookeeper 3.4 [puppet] - 10https://gerrit.wikimedia.org/r/1277085 (https://phabricator.wikimedia.org/T424266) [12:45:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2223.codfw.wmnet with reason: host reimage [12:47:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P91470 and previous config saved to /var/cache/conftool/dbconfig/20260424-124759-fceratto.json [12:49:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11855252 (10Jclark-ctr) [12:49:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11855255 (10Jclark-ctr) a:03Jclark-ctr [12:50:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage [12:53:24] (03PS8) 10Klausman: modules/amd_rocm: Add wrapper around devplugin binary [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) [12:53:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11855263 (10Jclark-ctr) [12:54:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2223.codfw.wmnet with reason: host reimage [12:55:32] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) (owner: 10Klausman) [12:58:03] (03CR) 10Elukey: [C:03+1] "Worth a try in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277043 (https://phabricator.wikimedia.org/T424049) (owner: 10Ilias Sarantopoulos) [12:58:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P91471 and previous config saved to /var/cache/conftool/dbconfig/20260424-125807-fceratto.json [12:59:38] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs1013:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:00:02] (03CR) 10Elukey: [C:03+1] "One nit and then you are free to go :)" [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) (owner: 10Klausman) [13:01:17] (03PS9) 10Klausman: modules/amd_rocm: Add wrapper around devplugin binary [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) [13:02:14] (03PS1) 10Marostegui: Revert "db1185,db2223: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277092 [13:02:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11855299 (10Jclark-ctr) https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1271631 It looks like we might need to use a different username for the time being. Luc... [13:04:00] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) (owner: 10Klausman) [13:04:35] (03CR) 10Klausman: [V:03+1] modules/amd_rocm: Add wrapper around devplugin binary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) (owner: 10Klausman) [13:05:58] (03CR) 10Klausman: [V:03+1 C:03+2] modules/amd_rocm: Add wrapper around devplugin binary [puppet] - 10https://gerrit.wikimedia.org/r/1277078 (https://phabricator.wikimedia.org/T420507) (owner: 10Klausman) [13:08:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T419635)', diff saved to https://phabricator.wikimedia.org/P91472 and previous config saved to /var/cache/conftool/dbconfig/20260424-130815-fceratto.json [13:08:20] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:08:33] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance [13:08:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1218 (T419635)', diff saved to https://phabricator.wikimedia.org/P91473 and previous config saved to /var/cache/conftool/dbconfig/20260424-130840-fceratto.json [13:09:55] (03CR) 10Marostegui: [C:03+2] Revert "db1185,db2223: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1277092 (owner: 10Marostegui) [13:11:27] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [13:12:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1185.eqiad.wmnet with OS trixie [13:14:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1185: after reimage to trixie [13:18:02] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [13:18:41] (03PS1) 10Klausman: modules/amd_rocm: fix wrong permissions on wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/1277097 [13:19:15] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277098 (https://phabricator.wikimedia.org/T423920) [13:19:25] FIRING: SystemdUnitFailed: amd-k8s-device-plugin.service on ml-serve1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2223.codfw.wmnet with OS trixie [13:20:09] (03CR) 10Muehlenhoff: [C:03+2] Add component for forward port of Zookeeper 3.4 [puppet] - 10https://gerrit.wikimedia.org/r/1277085 (https://phabricator.wikimedia.org/T424266) (owner: 10Muehlenhoff) [13:20:27] (03PS2) 10Klausman: modules/amd_rocm: fix wrong permissions on wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/1277097 [13:21:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2223: after reimage to trixie [13:22:52] (03PS3) 10Elukey: services: enable ingress for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276872 (https://phabricator.wikimedia.org/T424193) [13:23:07] (03CR) 10Klausman: [C:03+2] modules/amd_rocm: fix wrong permissions on wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/1277097 (owner: 10Klausman) [13:25:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T419635)', diff saved to https://phabricator.wikimedia.org/P91476 and previous config saved to /var/cache/conftool/dbconfig/20260424-132505-fceratto.json [13:25:10] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:26:35] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [13:28:17] (03PS1) 10Elukey: wmnet: add new CNAMEs for wikifunctions evaluators [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) [13:29:38] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs1013:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:29:52] FIRING: [15x] CertAlmostExpired: Certificate for service contint1002:1443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:30:05] (03CR) 10AKhatun: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277098 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton) [13:31:53] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [13:33:40] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1014.eqiad.wmnet [13:34:38] RESOLVED: [2x] CertAlmostExpired: Certificate for service wdqs2011:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:34:55] RESOLVED: SystemdUnitFailed: amd-k8s-device-plugin.service on ml-serve1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P91479 and previous config saved to /var/cache/conftool/dbconfig/20260424-133513-fceratto.json [13:39:38] FIRING: CertAlmostExpired: Certificate for service wdqs2011:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:40:16] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1014.eqiad.wmnet [13:41:49] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1014.eqiad.wmnet [13:44:38] FIRING: [2x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:45:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P91482 and previous config saved to /var/cache/conftool/dbconfig/20260424-134522-fceratto.json [13:47:05] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1014.eqiad.wmnet [13:50:55] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11855436 (10MoritzMuehlenhoff) [13:54:32] (03PS1) 10Klausman: profile/amd_gpu: Fix wrong unit override [puppet] - 10https://gerrit.wikimedia.org/r/1277103 (https://phabricator.wikimedia.org/T420507) [13:55:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T419635)', diff saved to https://phabricator.wikimedia.org/P91484 and previous config saved to /var/cache/conftool/dbconfig/20260424-135529-fceratto.json [13:55:35] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:55:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [13:55:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1219 (T419635)', diff saved to https://phabricator.wikimedia.org/P91485 and previous config saved to /var/cache/conftool/dbconfig/20260424-135555-fceratto.json [13:57:15] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1277103 (https://phabricator.wikimedia.org/T420507) (owner: 10Klausman) [13:57:46] (03CR) 10Klausman: [V:03+1 C:03+2] profile/amd_gpu: Fix wrong unit override [puppet] - 10https://gerrit.wikimedia.org/r/1277103 (https://phabricator.wikimedia.org/T420507) (owner: 10Klausman) [13:59:43] !log imported zookeeper 3.4.13-6+deb11u1~wmf13u1 into component/zookeeper34 for trixie-wikimedia (forward port of Zookeeper 3.4 from Bullseye to Trixie) [13:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:06] !log imported zookeeper 3.4.13-6+deb11u1~wmf13u1 into component/zookeeper34 for trixie-wikimedia (forward port of Zookeeper 3.4 from Bullseye to Trixie) T424266 [14:00:07] (03PS1) 10Marostegui: mariadb: Decommission db2148 [puppet] - 10https://gerrit.wikimedia.org/r/1277105 (https://phabricator.wikimedia.org/T424309) [14:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:11] T424266: Develop a plan for integrating conf200[7-9] - https://phabricator.wikimedia.org/T424266 [14:00:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1185: after reimage to trixie [14:00:35] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1014.eqiad.wmnet [14:01:16] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2148.codfw.wmnet [14:01:37] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db2148 [puppet] - 10https://gerrit.wikimedia.org/r/1277105 (https://phabricator.wikimedia.org/T424309) (owner: 10Marostegui) [14:05:36] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [14:05:39] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1014.eqiad.wmnet [14:05:49] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1015.eqiad.wmnet [14:05:52] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277098 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton) [14:07:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2223: after reimage to trixie [14:07:58] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277098 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton) [14:09:18] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2148.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [14:09:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2148.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [14:09:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:09:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2148.codfw.wmnet [14:09:52] (03PS1) 10Bking: wdqs-test: use new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277109 (https://phabricator.wikimedia.org/T420993) [14:10:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277109 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [14:10:09] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2148.codfw.wmnet - https://phabricator.wikimedia.org/T424309#11855652 (10Marostegui) a:05Marostegui→03None [14:10:18] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2148.codfw.wmnet - https://phabricator.wikimedia.org/T424309#11855656 (10Marostegui) This is ready for DC-Ops [14:11:12] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:11:26] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:11:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11855670 (10Jclark-ctr) I have re-added the IPs. If anyone is able to verify whether I’m missing anything and or what other steps are need... [14:13:13] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [14:13:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T419635)', diff saved to https://phabricator.wikimedia.org/P91488 and previous config saved to /var/cache/conftool/dbconfig/20260424-141315-fceratto.json [14:13:20] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:14:06] (03PS1) 10Ayounsi: anchor5001 is back online on Routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1277111 (https://phabricator.wikimedia.org/T421863) [14:15:31] (03CR) 10Elukey: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1277109 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [14:15:48] (03CR) 10Bking: [C:03+2] wdqs-test: use new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277109 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [14:15:55] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:16:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11855677 (10MoritzMuehlenhoff) [14:16:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11855678 (10Jclark-ctr) secure-cookbook sre.dns.netbox has finished running [14:19:07] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1015.eqiad.wmnet [14:20:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1277111 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [14:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:23:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P91489 and previous config saved to /var/cache/conftool/dbconfig/20260424-142323-fceratto.json [14:23:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719#11855729 (10Jclark-ctr) @JMeybohm tools-k8s-ctrl[1001-1002] and tools-k8s-worker[1001-100... [14:24:28] (03CR) 10Ayounsi: [C:03+2] anchor5001 is back online on Routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1277111 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [14:25:14] (03CR) 10FNegri: [C:03+1] kubeadm: quote kubectl arguments [puppet] - 10https://gerrit.wikimedia.org/r/1277065 (https://phabricator.wikimedia.org/T420565) (owner: 10Filippo Giunchedi) [14:27:52] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11855738 (10ssingh) Hi @RobH. Any update on this from Dell's end? [14:28:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11855750 (10ssingh) @VRiley-WMF: We are planning to do this the week of May 11. Does that work for you? [14:29:21] !log updating debdeploy on bullseye to 0.0.99.15 [14:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:34] !log imported debdeploy 0.0.99.15 for bullseye-wikimedia (compat release for Cumin 6) [14:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:10] (03CR) 10Majavah: [C:03+1] kubeadm: quote kubectl arguments [puppet] - 10https://gerrit.wikimedia.org/r/1277065 (https://phabricator.wikimedia.org/T420565) (owner: 10Filippo Giunchedi) [14:32:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:33:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P91490 and previous config saved to /var/cache/conftool/dbconfig/20260424-143332-fceratto.json [14:39:38] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs2014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:40:56] (03PS12) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [14:40:56] (03PS12) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [14:43:11] (03CR) 10CI reject: [V:04-1] profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:43:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T419635)', diff saved to https://phabricator.wikimedia.org/P91491 and previous config saved to /var/cache/conftool/dbconfig/20260424-144340-fceratto.json [14:43:44] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:43:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance [14:44:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T419635)', diff saved to https://phabricator.wikimedia.org/P91492 and previous config saved to /var/cache/conftool/dbconfig/20260424-144405-fceratto.json [14:44:38] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs2010:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:47:22] (03PS1) 10Arnaudb: gitlab: silence SystemdUnitFailed alert after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1277115 (https://phabricator.wikimedia.org/T424175) [14:49:29] (03PS13) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [14:49:29] (03PS13) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [14:49:38] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:49:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:54:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:59:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T419635)', diff saved to https://phabricator.wikimedia.org/P91493 and previous config saved to /var/cache/conftool/dbconfig/20260424-145940-fceratto.json [14:59:45] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:59:47] (03PS3) 10AKhatun: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) [15:04:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:05:09] FIRING: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:07:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T410589)', diff saved to https://phabricator.wikimedia.org/P91494 and previous config saved to /var/cache/conftool/dbconfig/20260424-150738-ladsgroup.json [15:07:44] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:08:52] (03PS1) 10Andrew Bogott: Partially revert "magnum: update capi worker build process" [puppet] - 10https://gerrit.wikimedia.org/r/1277118 [15:09:35] (03CR) 10Andrew Bogott: [C:03+2] Partially revert "magnum: update capi worker build process" [puppet] - 10https://gerrit.wikimedia.org/r/1277118 (owner: 10Andrew Bogott) [15:09:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:09:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P91495 and previous config saved to /var/cache/conftool/dbconfig/20260424-150949-fceratto.json [15:11:57] RECOVERY - MegaRAID on db1162 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:12:49] (03CR) 10Brouberol: [C:03+2] topic: mw-page-html-feature-counts-change-enrich and -next [puppet] - 10https://gerrit.wikimedia.org/r/1276794 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [15:17:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P91496 and previous config saved to /var/cache/conftool/dbconfig/20260424-151746-ladsgroup.json [15:18:05] (03CR) 10JavierMonton: [C:03+1] "It looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [15:19:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P91497 and previous config saved to /var/cache/conftool/dbconfig/20260424-151957-fceratto.json [15:24:29] (03PS1) 10Brouberol: dse-k8s-eqiad: add mw-page-html-feature-counts-change-enrich(-next) namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277119 [15:24:38] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:24:56] (03PS2) 10Brouberol: dse-k8s-eqiad: add mw-page-html-feature-counts-change-enrich(-next) namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277119 [15:24:56] (03PS4) 10Brouberol: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [15:27:31] (03CR) 10Herron: [C:03+1] profile::syslog::centralserver: Readd acme support [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [15:27:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P91498 and previous config saved to /var/cache/conftool/dbconfig/20260424-152755-ladsgroup.json [15:27:56] (03CR) 10AKhatun: [C:03+1] dse-k8s-eqiad: add mw-page-html-feature-counts-change-enrich(-next) namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277119 (owner: 10Brouberol) [15:30:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T419635)', diff saved to https://phabricator.wikimedia.org/P91499 and previous config saved to /var/cache/conftool/dbconfig/20260424-153005-fceratto.json [15:30:10] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:30:12] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance [15:30:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T419635)', diff saved to https://phabricator.wikimedia.org/P91500 and previous config saved to /var/cache/conftool/dbconfig/20260424-153020-fceratto.json [15:32:22] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.eqiad.wmnet with OS trixie [15:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:34:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:35:12] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1010.eqiad.wmnet with OS trixie [15:35:39] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 529 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 762, active_shards: 1103, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 529, delayed_unassigned_shards: 0, [15:35:39] of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 67.5857843137255 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:41] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 518 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 791, active_shards: 1133, relocating_shards: 0, initializing_shards: 0, unassigned_shards [15:35:41] elayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.62507571168989 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:41] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 537 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 721, active_shards: 996, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5 [15:35:41] yed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 64.9706457925636 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:41] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 529 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 762, active_shards: 1103, relocating_shards: 0, initializing_shards: 0, unassigned_shards: [15:35:41] ayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 67.5857843137255 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:45] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 518 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 791, active_shards: 1133, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 518, delayed_unassigned_shards: [15:35:45] r_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.62507571168989 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 537 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 721, active_shards: 996, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 537, delayed_unassigned_shards: 0, [15:35:45] f_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 64.9706457925636 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 537 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 721, active_shards: 996, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 537, delayed_unassigned_shards: 0, [15:35:45] f_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 64.9706457925636 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:45] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 529 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 762, active_shards: 1103, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 529, delayed_unassigned_shards: 0, [15:35:46] of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 67.5857843137255 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:46] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 537 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 721, active_shards: 996, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 537, delayed_unassigned_shards: 0, [15:35:47] f_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 64.9706457925636 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:47] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 529 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 762, active_shards: 1103, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 529, delayed_unassigned_shards: 0, [15:35:48] of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 67.5857843137255 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:48] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 518 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 791, active_shards: 1133, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 518, delayed_unassigned_shards: [15:35:49] r_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.62507571168989 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:49] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 518 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 791, active_shards: 1133, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 518, delayed_unassigned_shards: [15:35:50] r_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.62507571168989 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:50] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 537 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 721, active_shards: 996, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5 [15:35:51] yed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 64.9706457925636 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:51] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 529 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 762, active_shards: 1103, relocating_shards: 0, initializing_shards: 0, unassigned_shards: [15:35:52] ayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 67.5857843137255 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:52] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 518 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 791, active_shards: 1133, relocating_shards: 0, initializing_shards: 0, unassigned_shards [15:35:53] elayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.62507571168989 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:37:45] FIRING: CirrusStreamingUpdaterUnknownErrors: CirrusSearch consumer-cloudelastic@eqiad is failing write requests because of unknown errors - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterUnknownErrors [15:38:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T410589)', diff saved to https://phabricator.wikimedia.org/P91501 and previous config saved to /var/cache/conftool/dbconfig/20260424-153802-ladsgroup.json [15:38:07] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:38:20] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [15:38:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2195 (T410589)', diff saved to https://phabricator.wikimedia.org/P91502 and previous config saved to /var/cache/conftool/dbconfig/20260424-153827-ladsgroup.json [15:39:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:40:09] ^^ cloudelastic is expected, working on it now [15:40:26] (03PS1) 10Andrew Bogott: Further attempts to get setup_capi.sh.erb working properly [puppet] - 10https://gerrit.wikimedia.org/r/1277123 [15:41:45] (03CR) 10Andrew Bogott: [C:03+2] Further attempts to get setup_capi.sh.erb working properly [puppet] - 10https://gerrit.wikimedia.org/r/1277123 (owner: 10Andrew Bogott) [15:45:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T419635)', diff saved to https://phabricator.wikimedia.org/P91503 and previous config saved to /var/cache/conftool/dbconfig/20260424-154515-fceratto.json [15:45:19] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:49:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:50:14] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: add mw-page-html-feature-counts-change-enrich(-next) namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277119 (owner: 10Brouberol) [15:50:18] (03CR) 10Brouberol: [C:03+2] stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [15:50:36] (03CR) 10Brouberol: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [15:51:18] (03CR) 10Brouberol: [C:03+2] stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [15:54:34] (03PS4) 10Cwhite: profile::syslog::centralserver: Readd acme support [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [15:54:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:55:15] RESOLVED: CirrusStreamingUpdaterUnknownErrors: CirrusSearch consumer-cloudelastic@eqiad is failing write requests because of unknown errors - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterUnknownErrors [15:55:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P91504 and previous config saved to /var/cache/conftool/dbconfig/20260424-155523-fceratto.json [15:57:42] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [15:57:54] (03Merged) 10jenkins-bot: dse-k8s-eqiad: add mw-page-html-feature-counts-change-enrich(-next) namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277119 (owner: 10Brouberol) [15:58:11] (03Merged) 10jenkins-bot: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [15:58:17] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 78067392 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:59:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:59:17] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2983360 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:59:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:00:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:01:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:04:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:05:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P91505 and previous config saved to /var/cache/conftool/dbconfig/20260424-160531-fceratto.json [16:07:17] (03PS5) 10Cwhite: profile::syslog::centralserver: Readd acme support [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [16:08:24] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:18] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:10] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:14:50] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:15:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T419635)', diff saved to https://phabricator.wikimedia.org/P91506 and previous config saved to /var/cache/conftool/dbconfig/20260424-161541-fceratto.json [16:15:48] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:15:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance [16:16:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T419635)', diff saved to https://phabricator.wikimedia.org/P91507 and previous config saved to /var/cache/conftool/dbconfig/20260424-161607-fceratto.json [16:16:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:03] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [16:22:03] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [16:29:38] FIRING: [17x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:30:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:32:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T419635)', diff saved to https://phabricator.wikimedia.org/P91508 and previous config saved to /var/cache/conftool/dbconfig/20260424-163200-fceratto.json [16:32:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:34:18] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:30] (03CR) 10Cwhite: [C:03+2] profile::syslog::centralserver: Readd acme support [puppet] - 10https://gerrit.wikimedia.org/r/1276876 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [16:35:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:36:58] (03PS1) 10AKhatun: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277130 (https://phabricator.wikimedia.org/T424223) [16:39:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11856312 (10Jclark-ctr) [16:39:38] FIRING: [17x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:39:39] (03CR) 10AKhatun: [C:03+2] "Merging small change. Consistent with html_content_change app." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277130 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [16:41:36] (03Merged) 10jenkins-bot: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277130 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [16:42:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P91509 and previous config saved to /var/cache/conftool/dbconfig/20260424-164209-fceratto.json [16:43:53] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T424064#11856315 (10Jclark-ctr) Drive rebuild has finished all errors have cleared [16:43:57] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T424064#11856316 (10Jclark-ctr) 05Open→03Resolved [16:44:25] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:44:31] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:44:40] (03PS1) 10Majavah: P:rsyslog::receiver: Fix certificates in rsyslog_receiver_remedy [puppet] - 10https://gerrit.wikimedia.org/r/1277132 [16:46:25] (03PS2) 10Majavah: P:rsyslog::receiver: Fix certificates in rsyslog_receiver_remedy [puppet] - 10https://gerrit.wikimedia.org/r/1277132 [16:47:14] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8470/co" [puppet] - 10https://gerrit.wikimedia.org/r/1277132 (owner: 10Majavah) [16:49:07] !log dancy@deploy1003 Installing scap version "4.251.0" for 2 host(s) [16:50:59] !log dancy@deploy1003 Installation of scap version "4.251.0" completed for 2 hosts [16:52:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P91510 and previous config saved to /var/cache/conftool/dbconfig/20260424-165217-fceratto.json [16:54:37] FIRING: [22x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:54:42] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:56:18] Same thing as yesterday, going to silence it [16:56:44] yeah, it's going to page every 24h [16:56:48] until it's resolved [16:56:59] mhm [16:57:27] going to silence until say Tuesday (including) [16:57:40] I think it'll page anyway [16:57:56] but I can be very wrong [16:58:01] one way to find out! [16:58:03] oh hm [16:58:05] yeah :D [16:59:37] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:00:53] !resolve [17:00:53] All incidents are already resolved. [17:02:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T419635)', diff saved to https://phabricator.wikimedia.org/P91511 and previous config saved to /var/cache/conftool/dbconfig/20260424-170225-fceratto.json [17:02:30] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:02:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [17:02:44] Raine: hm, sirenbot only grabs incidents less than 24 hours old for those commands, I wonder if we should increase that [17:03:14] (but in the meantime you can still resolve it, you'll just have to do it in the browser) [17:03:38] rzl: right, good to know, thank you! [17:04:08] I don't think we need to increase it, using the app or web for these corner cases is fine, though it is true that I wasn't aware [17:05:03] I was just thinking about it because the "hey did you forget about this??? it's been 24 hours so I'm going to page you again" behavior from VO is already so confusing -- having !incidents also *not show it* might not help [17:05:21] true [17:06:15] maybe we show anything that's younger than 24h *and* anything that's younger than, I dunno, a week? and still not resolved [17:06:37] (there still needs to be some kind of upper bound just for performance reasons, we don't want to download the whole alert history from the API) [17:07:03] out of curiosity, if anyone has the ticket handy, I'd be grateful. Just want to keep track of the progress [17:15:47] Amir1: https://phabricator.wikimedia.org/T420993 [17:16:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [17:16:31] thanks! [17:26:21] (03PS1) 10Jasmine: service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1277148 (https://phabricator.wikimedia.org/T418748) [17:28:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11856464 (10VRiley-WMF) [17:29:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1251.eqiad.wmnet with reason: Maintenance [17:29:47] (03CR) 10Ottomata: stream: mw-page-html-feature-counts-change-enrich (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [17:29:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T419635)', diff saved to https://phabricator.wikimedia.org/P91512 and previous config saved to /var/cache/conftool/dbconfig/20260424-172952-fceratto.json [17:29:56] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:33:12] (03PS1) 10Bking: wdqs-internal-scholarly: Provision certificates with new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277153 (https://phabricator.wikimedia.org/T420993) [17:34:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277153 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [17:36:19] (03CR) 10Bking: [C:03+2] wdqs-internal-scholarly: Provision certificates with new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277153 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [17:37:43] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11856509 (10MisterSynergy) Any update here? The problem persists, my bots are regularly crashing due to persistent maxlag timeouts, and community members complain that... [17:38:53] (03PS3) 10Cwhite: P:rsyslog::receiver: Fix certificates in rsyslog_receiver_remedy [puppet] - 10https://gerrit.wikimedia.org/r/1277132 (https://phabricator.wikimedia.org/T424204) (owner: 10Majavah) [17:41:12] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/output/1277132/8471/" [puppet] - 10https://gerrit.wikimedia.org/r/1277132 (https://phabricator.wikimedia.org/T424204) (owner: 10Majavah) [17:44:18] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:45:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11856540 (10VRiley-WMF) @ssingh Yes, that works for me. I will plan for it then. Thanks! [17:45:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11856542 (10VRiley-WMF) p:05Triage→03Medium [17:46:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T419635)', diff saved to https://phabricator.wikimedia.org/P91513 and previous config saved to /var/cache/conftool/dbconfig/20260424-174641-fceratto.json [17:46:46] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:47:18] (03CR) 10Cwhite: [C:03+2] P:rsyslog::receiver: Fix certificates in rsyslog_receiver_remedy [puppet] - 10https://gerrit.wikimedia.org/r/1277132 (https://phabricator.wikimedia.org/T424204) (owner: 10Majavah) [17:49:13] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11856560 (10RobH) I reached back out to them yesterday and I'm awaiting a reply. They were bugging us about the invoice for the mainboard they failed to install. [17:49:18] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:49:38] FIRING: [2x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:55:23] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11856576 (10Jclark-ctr) a:03Jclark-ctr [17:56:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11856579 (10ssingh) Slight correction on my end, sorry: this is host is not an active host, so you can install the NIC whenever you want before May 11 as well. We will however pick t... [17:56:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:43] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [17:56:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P91515 and previous config saved to /var/cache/conftool/dbconfig/20260424-175649-fceratto.json [18:01:07] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb1016 to eqiad - jclark@cumin1003" [18:01:13] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb1016 to eqiad - jclark@cumin1003" [18:01:13] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:01:21] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:01:36] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:06:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P91516 and previous config saved to /var/cache/conftool/dbconfig/20260424-180657-fceratto.json [18:08:31] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:09:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:09:55] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:11:47] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1005.eqiad.wmnet with OS trixie [18:11:54] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1006.eqiad.wmnet with OS trixie [18:11:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11856646 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-ctrl1005.eqiad.wmne... [18:12:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11856649 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-ctrl1006.eqiad.wmne... [18:17:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T419635)', diff saved to https://phabricator.wikimedia.org/P91517 and previous config saved to /var/cache/conftool/dbconfig/20260424-181705-fceratto.json [18:17:11] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:17:12] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [18:23:42] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1005.eqiad.wmnet with reason: host reimage [18:23:51] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1006.eqiad.wmnet with reason: host reimage [18:29:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1005.eqiad.wmnet with reason: host reimage [18:34:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1006.eqiad.wmnet with reason: host reimage [18:34:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:39:38] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:45:43] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [18:48:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [18:48:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl1005.eqiad.wmnet with OS trixie [18:48:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11856759 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-ctrl1005.eqiad.wmnet wi... [18:49:38] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:49:43] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [18:50:08] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [18:50:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl1006.eqiad.wmnet with OS trixie [18:50:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11856770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-ctrl1006.eqiad.wmnet wi... [18:50:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11856774 (10Jclark-ctr) 05Open→03Resolved [19:02:07] (03PS1) 10Eevans: linked-artifacts: upgrade to hoarde v1.1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277172 (https://phabricator.wikimedia.org/T423168) [19:04:31] (03CR) 10Eevans: [C:03+2] linked-artifacts: upgrade to hoarde v1.1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277172 (https://phabricator.wikimedia.org/T423168) (owner: 10Eevans) [19:06:34] (03Merged) 10jenkins-bot: linked-artifacts: upgrade to hoarde v1.1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277172 (https://phabricator.wikimedia.org/T423168) (owner: 10Eevans) [19:06:54] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [19:09:38] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:10:38] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding druid-internal1001 to eqiad - jclark@cumin1003" [19:10:43] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding druid-internal1001 to eqiad - jclark@cumin1003" [19:10:43] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:11:59] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:12:09] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:12:17] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:12:27] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:12:31] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:12:33] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:13:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:59] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:14:58] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid-internal1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:15:39] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:15:45] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:16:21] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:16:35] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:16:51] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:17:55] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [19:18:19] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [19:18:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid-internal1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:20:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid-internal1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:20:19] (03PS14) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [19:20:19] (03PS14) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [19:20:19] (03PS1) 10Elukey: cfssl::cert: add require for csr when swapping intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277175 (https://phabricator.wikimedia.org/T420993) [19:20:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid-internal1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:20:40] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:21:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid-internal1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:22:39] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid-internal1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:24:31] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:24:54] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:25:25] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:27:19] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host druid-internal1001.eqiad.wmnet with OS trixie [19:27:26] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host druid-internal1002.eqiad.wmnet with OS trixie [19:27:28] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host druid-internal1003.eqiad.wmnet with OS trixie [19:27:31] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11856895 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host druid-internal1001.eqiad.wmnet with OS trixie [19:27:34] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11856896 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host druid-internal1002.eqiad.wmnet with OS trixie [19:27:37] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11856897 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host druid-internal1003.eqiad.wmnet with OS trixie [19:27:56] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host druid-internal1004.eqiad.wmnet with OS trixie [19:27:57] (03PS1) 10Eevans: Revert "linked-artifacts: upgrade to hoarde v1.1.3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277178 [19:28:08] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11856898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host druid-internal1004.eqiad.wmnet with OS trixie [19:29:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:32:12] (03CR) 10Eevans: [C:03+2] Revert "linked-artifacts: upgrade to hoarde v1.1.3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277178 (owner: 10Eevans) [19:32:13] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:33:42] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:34:16] (03Merged) 10jenkins-bot: Revert "linked-artifacts: upgrade to hoarde v1.1.3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277178 (owner: 10Eevans) [19:35:49] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:36:06] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid-internal1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:37:03] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [19:37:23] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [19:37:53] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host druid-internal1006.eqiad.wmnet with OS trixie [19:38:03] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11856927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host druid-internal1006.eqiad.wmnet with OS trixie [19:38:49] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid-internal1003.eqiad.wmnet with reason: host reimage [19:38:50] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid-internal1001.eqiad.wmnet with reason: host reimage [19:38:52] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid-internal1004.eqiad.wmnet with reason: host reimage [19:38:54] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid-internal1002.eqiad.wmnet with reason: host reimage [19:40:38] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:40:49] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:41:38] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:42:01] (03PS1) 10Bking: opensearch 1: parameterize disk threshold values [puppet] - 10https://gerrit.wikimedia.org/r/1277180 [19:42:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277180 (owner: 10Bking) [19:44:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:45:06] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid-internal1003.eqiad.wmnet with reason: host reimage [19:45:13] (03PS1) 10Eevans: Revert^2 "linked-artifacts: upgrade to hoarde v1.1.3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277181 [19:47:37] (03CR) 10Eevans: [C:03+2] Revert^2 "linked-artifacts: upgrade to hoarde v1.1.3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277181 (owner: 10Eevans) [19:49:08] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid-internal1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:49:17] (03PS2) 10Bking: opensearch 1: parameterize disk threshold values [puppet] - 10https://gerrit.wikimedia.org/r/1277180 (https://phabricator.wikimedia.org/T422860) [19:49:24] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid-internal1006.eqiad.wmnet with reason: host reimage [19:49:40] (03Merged) 10jenkins-bot: Revert^2 "linked-artifacts: upgrade to hoarde v1.1.3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277181 (owner: 10Eevans) [19:50:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid-internal1001.eqiad.wmnet with reason: host reimage [19:50:45] FIRING: [2x] ProbeDown: Service wdqs2017:443 has failed probes (http_wdqs_internal_scholarly_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2017:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:51:02] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host druid-internal1005.eqiad.wmnet with OS trixie [19:51:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11856948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host druid-internal1005.eqiad.wmnet with OS trixie [19:51:29] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [19:51:40] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [19:51:53] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [19:52:05] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [19:53:59] (03PS3) 10Bking: opensearch 1: parameterize disk threshold values and raise limits in cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/1277180 (https://phabricator.wikimedia.org/T422860) [19:54:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid-internal1006.eqiad.wmnet with reason: host reimage [19:54:34] (03CR) 10CI reject: [V:04-1] opensearch 1: parameterize disk threshold values and raise limits in cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/1277180 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:55:35] (03PS4) 10Bking: opensearch 1: parameterize disk threshold values/up limits in ce [puppet] - 10https://gerrit.wikimedia.org/r/1277180 (https://phabricator.wikimedia.org/T422860) [19:56:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277180 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:59:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid-internal1004.eqiad.wmnet with reason: host reimage [19:59:38] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:00:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11856980 (10Jclark-ctr) [20:01:20] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:03:07] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid-internal1005.eqiad.wmnet with reason: host reimage [20:04:26] jclark@cumin1003 reimage (PID 686083) is awaiting input [20:04:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:04:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid-internal1003.eqiad.wmnet with OS trixie [20:04:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:04:50] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11856995 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host druid-internal1003.eqiad.wmnet with OS trixie completed: - druid-internal1003 (**PAS... [20:04:56] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid-internal1002.eqiad.wmnet with reason: host reimage [20:06:56] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:07:19] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:07:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid-internal1001.eqiad.wmnet with OS trixie [20:07:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11857016 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host druid-internal1001.eqiad.wmnet with OS trixie completed: - druid-internal1001 (**PAS... [20:08:24] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid-internal1005.eqiad.wmnet with reason: host reimage [20:09:32] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:12:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:12:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid-internal1006.eqiad.wmnet with OS trixie [20:12:42] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11857041 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host druid-internal1006.eqiad.wmnet with OS trixie completed: - druid-internal1006 (**PAS... [20:14:09] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:15:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:15:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid-internal1004.eqiad.wmnet with OS trixie [20:15:47] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11857056 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host druid-internal1004.eqiad.wmnet with OS trixie completed: - druid-internal1004 (**PAS... [20:19:43] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:22:49] jclark@cumin1003 reimage (PID 686055) is awaiting input [20:23:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:23:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid-internal1002.eqiad.wmnet with OS trixie [20:23:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11857094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host druid-internal1002.eqiad.wmnet with OS trixie completed: - druid-internal1002 (**PAS... [20:26:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:26:57] (03PS5) 10Ryan Kemper: opensearch 1: parameterize disk threshold values/up limits in ce [puppet] - 10https://gerrit.wikimedia.org/r/1277180 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [20:27:27] (03CR) 10Ryan Kemper: [C:03+1] opensearch 1: parameterize disk threshold values/up limits in ce [puppet] - 10https://gerrit.wikimedia.org/r/1277180 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [20:27:30] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:27:45] (03CR) 10Ryan Kemper: [C:03+1] "My PS5 just added a missing newline; effectively a NO-OP" [puppet] - 10https://gerrit.wikimedia.org/r/1277180 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [20:28:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:28:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid-internal1005.eqiad.wmnet with OS trixie [20:28:08] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11857098 (10Jclark-ctr) [20:28:11] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11857099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host druid-internal1005.eqiad.wmnet with OS trixie completed: - druid-internal1005 (**PAS... [20:28:23] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11857112 (10Jclark-ctr) 05Open→03Resolved [20:32:15] RESOLVED: [2x] ProbeDown: Service wdqs2017:443 has failed probes (http_wdqs_internal_scholarly_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2017:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:42:29] (03CR) 10Bking: [C:03+2] opensearch 1: parameterize disk threshold values/up limits in ce [puppet] - 10https://gerrit.wikimedia.org/r/1277180 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [20:54:18] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:59:18] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:59:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:00:42] 10ops-eqiad, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1359:9290 - https://phabricator.wikimedia.org/T424396 (10phaultfinder) 03NEW [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:33:39] (03PS1) 10Ryan Kemper: cloudelastic: add missing extra-analysis plugins [puppet] - 10https://gerrit.wikimedia.org/r/1277194 (https://phabricator.wikimedia.org/T422860) [21:34:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:35:45] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277194 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:36:53] (03PS1) 10Dduvall: zuul: Parameterize web root and host [puppet] - 10https://gerrit.wikimedia.org/r/1277195 [21:38:02] (03CR) 10Bking: [C:03+2] cloudelastic: add missing extra-analysis plugins [puppet] - 10https://gerrit.wikimedia.org/r/1277194 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:39:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:42:09] (03CR) 10RLazarus: [C:03+1] wmnet: add new CNAMEs for wikifunctions evaluators [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [21:44:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:52:46] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1424, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 221, delayed_unassigned_shards: 0, number_of_pendin [21:52:46] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.25075711689885 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:52:46] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 816, active_shards: 1388, relocating_shards: 1, initializing_shards: 9, unassigned_shards: 235, delayed_unassigned_shards: 0, number_of_pending_ta [21:52:46] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 136138, active_shards_percent_as_number: 85.04901960784314 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:52:46] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 816, active_shards: 1388, relocating_shards: 1, initializing_shards: 9, unassigned_shards: 235, delayed_unassigned_shards: 0, number_of_pending_ta [21:52:46] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 136142, active_shards_percent_as_number: 85.04901960784314 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:52:46] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1424, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 221, delayed_unassigned_shards: 0, number_of_pendin [21:52:47] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.25075711689885 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:52:47] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1424, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 221, delayed_unassigned_shards: 0, number_of_pendin [21:52:48] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.25075711689885 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:52:48] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1388, relocating_shards: 1, initializing_shards: 9, unassigned_shards: 235, delayed_unassign [21:52:49] s: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 136768, active_shards_percent_as_number: 85.04901960784314 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:52:49] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 825, active_shards: 1424, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 221, delayed_unas [21:52:50] hards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.25075711689885 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:52:50] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 825, active_shards: 1428, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 218, delayed_unas [21:52:51] hards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 31, active_shards_percent_as_number: 86.49303452453059 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:52:51] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1390, relocating_shards: 1, initializing_shards: 8, unassigned_shards: 234, delayed_unassign [21:52:52] s: 0, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 140785, active_shards_percent_as_number: 85.17156862745098 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:53:44] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 816, active_shards: 1423, relocating_shards: 1, initializing_shards: 7, unassigned_shards: 202, delayed_unassigned_shards: 0, number_of_pending_ta [21:53:44] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 193513, active_shards_percent_as_number: 87.19362745098039 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:54:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:59:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:00:43] (03PS1) 10Dduvall: zuul: Use master branch of integration/config [puppet] - 10https://gerrit.wikimedia.org/r/1277198 [22:02:50] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1304, relocating_shards: 6, initializing_shards: 40, unassigned_shards: 189, delayed_unassig [22:02:50] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.06196999347684 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:03:46] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 766, active_shards: 1311, relocating_shards: 4, initializing_shards: 40, unassigned_shards: 182, delayed_unassigned_shards: 0, number_of_pending_t [22:03:46] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.51859099804305 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:03:46] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 766, active_shards: 1311, relocating_shards: 4, initializing_shards: 40, unassigned_shards: 182, delayed_unassigned_shards: 0, number_of_pending_t [22:03:46] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.51859099804305 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:03:46] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 766, active_shards: 1311, relocating_shards: 4, initializing_shards: 40, unassigned_shards: 182, delayed_unassigned_shards: 0, number_of_pending_t [22:03:46] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.51859099804305 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:03:48] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1311, relocating_shards: 4, initializing_shards: 40, unassigned_shards: 182, delayed_unassig [22:03:48] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.51859099804305 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:04:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:34:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:59:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:40:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1277219 [23:40:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1277219 (owner: 10TrainBranchBot) [23:51:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1277219 (owner: 10TrainBranchBot) [23:54:38] FIRING: [17x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired