[00:13:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:14:10] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:09:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1276453 [01:09:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1276453 (owner: 10TrainBranchBot) [01:20:06] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1276453 (owner: 10TrainBranchBot) [01:52:45] FIRING: ProbeDown: Service etherpad1004:9001 has failed probes (http_etherpad_nodejs_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1004:9001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:57:40] RESOLVED: ProbeDown: Service etherpad1004:9001 has failed probes (http_etherpad_nodejs_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1004:9001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:00:53] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:06] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 12s) [02:09:18] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:18] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:24] FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:44:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T410589)', diff saved to https://phabricator.wikimedia.org/P91317 and previous config saved to /var/cache/conftool/dbconfig/20260423-024447-ladsgroup.json [02:44:51] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:54:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P91318 and previous config saved to /var/cache/conftool/dbconfig/20260423-025455-ladsgroup.json [03:05:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P91319 and previous config saved to /var/cache/conftool/dbconfig/20260423-030504-ladsgroup.json [03:15:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T410589)', diff saved to https://phabricator.wikimedia.org/P91320 and previous config saved to /var/cache/conftool/dbconfig/20260423-031512-ladsgroup.json [03:15:17] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:15:30] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [03:15:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T410589)', diff saved to https://phabricator.wikimedia.org/P91321 and previous config saved to /var/cache/conftool/dbconfig/20260423-031538-ladsgroup.json [03:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:14:10] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:27:21] (03PS1) 10Marostegui: mariadb: Productionize pc2022 [puppet] - 10https://gerrit.wikimedia.org/r/1276474 (https://phabricator.wikimedia.org/T418973) [05:27:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[2012,2022].codfw.wmnet with reason: Cloning [05:28:23] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc2012: Cloning pc2022 from pc2012 [05:28:23] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:28:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:28:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2012: Cloning pc2022 from pc2012 [05:28:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[2012,2022].codfw.wmnet,pc1012.eqiad.wmnet with reason: Cloning [05:28:59] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc2022 [puppet] - 10https://gerrit.wikimedia.org/r/1276474 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [05:37:02] (03PS1) 10Marostegui: pc2012: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276477 [05:37:30] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1276477 (owner: 10Marostegui) [05:37:41] (03CR) 10Marostegui: [C:03+2] pc2012: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276477 (owner: 10Marostegui) [05:40:52] (03PS1) 10Marostegui: mariadb: Productionize db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276478 (https://phabricator.wikimedia.org/T418979) [05:41:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2143: Cloning db2252 from db2143 [05:41:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:41:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:41:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2143: Cloning db2252 from db2143 [05:41:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2143,2252].codfw.wmnet,db1153.eqiad.wmnet with reason: Cloning [05:42:43] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276478 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [05:43:37] (03PS1) 10Marostegui: db2143: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276479 [05:44:13] (03CR) 10Marostegui: [C:03+2] db2143: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276479 (owner: 10Marostegui) [05:57:30] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:35:00 on gerrit2003.wikimedia.org with reason: Gerrit maintenance [05:57:51] !log jelto@cumin1003 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:35:00 on gerrit.discovery.wmnet with reason: Gerrit maintenance [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T0600) [06:00:04] marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T0600). [06:00:04] jelto: Time to do the Gerrit maintenance - T333143 deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T0600). [06:00:05] T333143: Move Gerrit data out of root partition - https://phabricator.wikimedia.org/T333143 [06:00:20] (03CR) 10Jelto: [C:03+2] gerrit: migrate gerrit2003 data to /srv/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1273683 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [06:04:59] !log start gerrit2003 maintenance - T333143 [06:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:46] FIRING: [7x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [06:10:46] RESOLVED: [7x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [06:11:42] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1275942 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [06:12:09] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1275943 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [06:14:51] (03CR) 10Jelto: [V:03+1 C:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8453/co" [puppet] - 10https://gerrit.wikimedia.org/r/1275942 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [06:24:06] (03PS1) 10Marostegui: db2143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276508 (https://phabricator.wikimedia.org/T424171) [06:25:04] (03CR) 10Marostegui: [C:03+2] db2143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276508 (https://phabricator.wikimedia.org/T424171) (owner: 10Marostegui) [06:28:05] !log gerrit2003 maintenance finished - T333143 [06:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:09] T333143: Move Gerrit data out of root partition - https://phabricator.wikimedia.org/T333143 [06:28:15] (03PS1) 10Marostegui: db2252: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276509 (https://phabricator.wikimedia.org/T418979) [06:29:33] (03PS1) 10Marostegui: instances.yaml: Replace db2143 with db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276510 (https://phabricator.wikimedia.org/T418979) [06:30:44] (03PS7) 10Jelto: gerrit: migrate gerrit_site away from root partition [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [06:32:35] (03CR) 10Jelto: [V:03+1 C:04-1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [06:38:24] FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:48:50] (03CR) 10Marostegui: [C:03+2] db2252: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276509 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [06:49:22] (03CR) 10Marostegui: [C:03+2] instances.yaml: Replace db2143 with db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276510 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [06:52:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2143 from ms3, add db2252 T418979', diff saved to https://phabricator.wikimedia.org/P91326 and previous config saved to /var/cache/conftool/dbconfig/20260423-065214-marostegui.json [06:52:19] T418979: Productionize db225[0-3] - https://phabricator.wikimedia.org/T418979 [06:53:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Make db2252 master of ms3 T418979', diff saved to https://phabricator.wikimedia.org/P91327 and previous config saved to /var/cache/conftool/dbconfig/20260423-065323-marostegui.json [06:56:27] (03PS1) 10Marostegui: db2251: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276512 [06:56:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2252: Cloning [06:56:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:56:57] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [06:56:57] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2252: Cloning [06:58:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms1 with db2252 as new codfw master T418979', diff saved to https://phabricator.wikimedia.org/P91328 and previous config saved to /var/cache/conftool/dbconfig/20260423-065803-marostegui.json [06:58:08] T418979: Productionize db225[0-3] - https://phabricator.wikimedia.org/T418979 [06:58:54] (03CR) 10Marostegui: [C:03+2] db2251: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276512 (owner: 10Marostegui) [07:00:04] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:07:43] (03CR) 10Jelto: [C:03+2] helmfile.d/miscweb: add values file for aux private secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275934 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [07:10:25] (03Merged) 10jenkins-bot: helmfile.d/miscweb: add values file for aux private secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275934 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [07:11:42] (03PS1) 10Marostegui: instances.yaml: Remove db2145 [puppet] - 10https://gerrit.wikimedia.org/r/1276514 (https://phabricator.wikimedia.org/T424177) [07:11:43] (03PS3) 10Jcrespo: mariadb: Set db2250 as a new codfw s1 backup source [puppet] - 10https://gerrit.wikimedia.org/r/1276382 (https://phabricator.wikimedia.org/T418979) [07:13:28] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276382 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [07:13:30] (03CR) 10Marostegui: [C:03+1] mariadb: Set db2250 as a new codfw s1 backup source [puppet] - 10https://gerrit.wikimedia.org/r/1276382 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [07:13:37] (03PS1) 10Muehlenhoff: Add doh5003/5004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1276516 (https://phabricator.wikimedia.org/T421863) [07:14:02] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2145 [puppet] - 10https://gerrit.wikimedia.org/r/1276514 (https://phabricator.wikimedia.org/T424177) (owner: 10Marostegui) [07:14:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11850023 (10MoritzMuehlenhoff) The prometheus5003 VM is ready [07:14:53] (03CR) 10Jcrespo: [C:03+2] mariadb: Set db2250 as a new codfw s1 backup source [puppet] - 10https://gerrit.wikimedia.org/r/1276382 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [07:15:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2145 from dbctl T424177', diff saved to https://phabricator.wikimedia.org/P91329 and previous config saved to /var/cache/conftool/dbconfig/20260423-071500-marostegui.json [07:15:05] T424177: decommission db2145.codfw.wmnet - https://phabricator.wikimedia.org/T424177 [07:16:54] (03PS2) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) [07:20:08] (03PS2) 10Giuseppe Lavagetto: cache_misc: apply traffic classification [puppet] - 10https://gerrit.wikimedia.org/r/1276403 [07:22:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye [07:23:01] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2007.codfw.wmnet with OS bullseye [07:24:20] (03CR) 10Muehlenhoff: [C:03+2] Apply the tcp-proxy role to tcp-proxy5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1275942 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:30:57] (03CR) 10Ayounsi: [C:03+1] Add doh5003/5004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1276516 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:32:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:33:34] (03PS1) 10Marostegui: db2145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276520 (https://phabricator.wikimedia.org/T424177) [07:34:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:34:18] (03CR) 10Marostegui: [C:03+2] db2145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276520 (https://phabricator.wikimedia.org/T424177) (owner: 10Marostegui) [07:37:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:40:19] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804#11850086 (10MoritzMuehlenhoff) [07:41:01] (03CR) 10Muehlenhoff: [C:03+2] Add doh5003/5004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1276516 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:41:29] (03CR) 10Muehlenhoff: [C:03+2] Fix Cumin alias for kerberized SSH access [puppet] - 10https://gerrit.wikimedia.org/r/1275883 (owner: 10Muehlenhoff) [07:45:49] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage [07:48:22] (03CR) 10Muehlenhoff: [C:03+2] firewall::service: Add a new parameter unrestricted_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [07:50:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage [07:52:47] (03PS1) 10Marostegui: installserver: Remove db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276525 [07:54:33] (03PS1) 10Muehlenhoff: http-sso-django-login: Switch to firewall::service and restrict access [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) [07:54:52] (03CR) 10Marostegui: [C:03+2] installserver: Remove db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276525 (owner: 10Marostegui) [07:55:02] (03CR) 10CI reject: [V:04-1] http-sso-django-login: Switch to firewall::service and restrict access [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [07:57:12] (03PS1) 10Marostegui: installserver: Add pc20[21-24] [puppet] - 10https://gerrit.wikimedia.org/r/1276527 (https://phabricator.wikimedia.org/T418973) [07:58:02] (03PS2) 10Muehlenhoff: http-sso-django-login: Switch to firewall::service and restrict access [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) [07:58:23] (03PS2) 10Marostegui: installserver: Add pc10[21-24] [puppet] - 10https://gerrit.wikimedia.org/r/1276527 (https://phabricator.wikimedia.org/T418973) [08:01:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh5003.wikimedia.org [08:01:07] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:01:35] (03CR) 10Marostegui: [C:03+2] installserver: Add pc10[21-24] [puppet] - 10https://gerrit.wikimedia.org/r/1276527 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [08:04:56] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh5003.wikimedia.org - jmm@cumin2002" [08:05:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh5003.wikimedia.org - jmm@cumin2002" [08:05:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:05:02] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh5003.wikimedia.org on all recursors [08:05:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh5003.wikimedia.org on all recursors [08:05:41] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh5003.wikimedia.org - jmm@cumin2002" [08:05:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh5003.wikimedia.org - jmm@cumin2002" [08:06:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh5003.wikimedia.org with OS bookworm [08:07:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11850174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host doh5003.wikimedia.org with OS bookworm [08:07:08] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11850175 (10elukey) Update: I tested the new SM firmwares for BIOS and BMC, but the latter seems leading to an inconsistent state: the update doesn't start because of a weird issu... [08:07:17] (03CR) 10Majavah: http-sso-django-login: Switch to firewall::service and restrict access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [08:09:31] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: migrate gerrit_site away from root partition [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [08:10:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2007.codfw.wmnet with OS bullseye [08:10:09] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850190 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2007.codfw.wmnet with OS bullseye completed... [08:10:49] (03PS1) 10Majavah: P:toolforge::prometheus: Stop monitoring ingress-nginx [puppet] - 10https://gerrit.wikimedia.org/r/1276596 (https://phabricator.wikimedia.org/T392356) [08:10:59] (03PS8) 10Arnaudb: envoyproxy: rebuild envoy.yaml when the placeholder is created [puppet] - 10https://gerrit.wikimedia.org/r/1275827 (https://phabricator.wikimedia.org/T421827) [08:12:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [08:12:53] (03CR) 10CI reject: [V:04-1] P:toolforge::prometheus: Stop monitoring ingress-nginx [puppet] - 10https://gerrit.wikimedia.org/r/1276596 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [08:13:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [08:14:10] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:15] (03PS2) 10Majavah: P:toolforge::prometheus: Stop monitoring ingress-nginx [puppet] - 10https://gerrit.wikimedia.org/r/1276596 (https://phabricator.wikimedia.org/T392356) [08:16:03] (03CR) 10Giuseppe Lavagetto: cache::haproxy: support wikilink style usernames in UAs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992) (owner: 10Giuseppe Lavagetto) [08:17:57] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850222 (10MatthewVernon) [08:18:08] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Stop monitoring ingress-nginx [puppet] - 10https://gerrit.wikimedia.org/r/1276596 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [08:25:44] (03CR) 10Elukey: "True, but lookup() outside profiles have some sense only to lookup very generic variables that are supposed to be everywhere, and/or globa" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:28:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [08:28:53] (03CR) 10Filippo Giunchedi: "LGTM overall, adding Moritz too" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [08:29:01] (03PS7) 10Elukey: admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) [08:29:30] (03CR) 10JavierMonton: [C:03+1] EventStreamConfig - add rc0 streams for html and feature count change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276397 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata) [08:30:46] (03CR) 10Mpostoronca: "Could you tell us how to test this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [08:31:05] (03PS8) 10Elukey: admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) [08:31:35] (03CR) 10Elukey: "All right I think I got it, lemme know if now it makes sense!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:32:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11850257 (10ayounsi) Those hosts are 7 years old, shouldn't they be fully decom ? FYI, previous data are visible in https://netbox.wikime... [08:32:07] (03CR) 10Mpostoronca: "Is there some link to the documentation ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [08:34:18] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:35:37] (03CR) 10Muehlenhoff: "I had been wondering about the same. It's also really unclear how the difference between the "nochange" and the standard repo actually? If" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [08:36:32] (03PS2) 10Muehlenhoff: Add tcp-proxy5003/5004 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1275943 (https://phabricator.wikimedia.org/T421863) [08:36:37] (03CR) 10Elukey: ganeti: Move pki::get_cert into the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [08:37:36] (03PS4) 10Jcrespo: mariadb: Pool db2250 for backups instead of db2141 [puppet] - 10https://gerrit.wikimedia.org/r/1276406 (https://phabricator.wikimedia.org/T418979) [08:38:22] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276406 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [08:39:41] (03CR) 10Kosta Harlan: [C:03+1] hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [08:40:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2037.codfw.wmnet with reason: Maintenance [08:40:29] (03CR) 10Kosta Harlan: [C:03+1] "https://docs.hcaptcha.com/enterprise/secure_enclave#allowpopups-parameter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [08:40:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91330 and previous config saved to /var/cache/conftool/dbconfig/20260423-084035-fceratto.json [08:42:15] (03CR) 10Muehlenhoff: [C:03+2] Add tcp-proxy5003/5004 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1275943 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [08:42:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:47:13] (03CR) 10Muehlenhoff: ganeti: Move pki::get_cert into the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [08:47:24] (03CR) 10Filippo Giunchedi: "I couldn't find any explicit documentation, though from looking at the packages in nochange my understanding is that they are required dep" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [08:50:06] (03CR) 10Muehlenhoff: "True, but the escapsulation has already been broken by the cases listed above :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:52:07] (03PS2) 10Giuseppe Lavagetto: cache::haproxy: support wikilink style usernames in UAs [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992) [08:52:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91333 and previous config saved to /var/cache/conftool/dbconfig/20260423-085207-fceratto.json [08:52:50] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=tcp-proxy5003.eqsin.wmnet [08:53:07] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=tcp-proxy5003.eqsin.wmnet [08:53:17] (03PS1) 10Marostegui: instances.yaml: Remove db2146 [puppet] - 10https://gerrit.wikimedia.org/r/1276612 (https://phabricator.wikimedia.org/T418979) [08:55:36] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5003.wikimedia.org with reason: host reimage [08:55:47] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2146 [puppet] - 10https://gerrit.wikimedia.org/r/1276612 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [08:56:46] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=tcp-proxy5004.eqsin.wmnet [08:56:48] (03CR) 10Jcrespo: [C:03+2] mariadb: Pool db2250 for backups instead of db2141 [puppet] - 10https://gerrit.wikimedia.org/r/1276406 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [08:56:51] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=tcp-proxy5004.eqsin.wmnet [08:58:11] !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=tcp-proxy5001.eqsin.wmnet [08:58:15] !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=tcp-proxy5002.eqsin.wmnet [08:59:49] (03CR) 10Daniel Kinzler: [C:03+1] cache::haproxy: support wikilink style usernames in UAs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992) (owner: 10Giuseppe Lavagetto) [09:00:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2146 from dbctl T424179', diff saved to https://phabricator.wikimedia.org/P91334 and previous config saved to /var/cache/conftool/dbconfig/20260423-090014-marostegui.json [09:00:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh5003.wikimedia.org with reason: host reimage [09:00:19] T424179: Add an edit tag when someone edits another user's user CSS - https://phabricator.wikimedia.org/T424179 [09:02:16] (03PS3) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) [09:02:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037', diff saved to https://phabricator.wikimedia.org/P91335 and previous config saved to /var/cache/conftool/dbconfig/20260423-090216-fceratto.json [09:06:52] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:07:33] (03CR) 10Elukey: [C:03+1] ganeti: Move pki::get_cert into the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [09:07:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey) [09:07:52] (03PS3) 10Mhorsey: Enable the CampaignEvents extension on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) [09:08:13] (03PS1) 10AikoChou: ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276618 (https://phabricator.wikimedia.org/T416384) [09:09:52] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:09:59] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276618 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [09:11:15] (03CR) 10AikoChou: [C:03+2] ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276618 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [09:12:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037', diff saved to https://phabricator.wikimedia.org/P91336 and previous config saved to /var/cache/conftool/dbconfig/20260423-091224-fceratto.json [09:13:15] (03Merged) 10jenkins-bot: ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276618 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [09:15:39] (03CR) 10Harroyo-wmf: "I've tested this locally by setting `$wgHCaptchaApiUrl` in `LocalSettings.php` like this:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [09:16:32] (03PS2) 10Harroyo-wmf: hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) [09:17:13] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:19:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh5003.wikimedia.org with OS bookworm [09:19:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh5003.wikimedia.org [09:19:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11850448 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host doh5003.wikimedia.org with OS bookworm completed: - doh5003... [09:22:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91337 and previous config saved to /var/cache/conftool/dbconfig/20260423-092232-fceratto.json [09:22:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2047.codfw.wmnet with reason: Maintenance [09:23:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2047 (T419961)', diff saved to https://phabricator.wikimedia.org/P91338 and previous config saved to /var/cache/conftool/dbconfig/20260423-092303-fceratto.json [09:25:19] (03PS1) 10Ilias Sarantopoulos: ml-services: update prod image for outlinktopic model (v2 inf protocol) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276626 (https://phabricator.wikimedia.org/T423582) [09:25:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh5004.wikimedia.org [09:25:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:25:54] (03PS4) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) [09:27:41] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:29:13] (03CR) 10AikoChou: [C:03+1] ml-services: update prod image for outlinktopic model (v2 inf protocol) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276626 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos) [09:30:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047 (T419961)', diff saved to https://phabricator.wikimedia.org/P91339 and previous config saved to /var/cache/conftool/dbconfig/20260423-093010-fceratto.json [09:32:11] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh5004.wikimedia.org - jmm@cumin2002" [09:34:43] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version as the default for devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275854 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [09:34:49] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the test-k8s instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275855 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [09:35:12] (03CR) 10Kamila Součková: [C:03+1] api rate limits: use global apihighlimits-requestor group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275410 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler) [09:35:15] (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the analytics-test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275856 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [09:35:16] jmm@cumin2002 makevm (PID 3057772) is awaiting input [09:35:33] (03PS1) 10Ayounsi: ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 [09:37:37] (03CR) 10CI reject: [V:04-1] ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 (owner: 10Ayounsi) [09:38:13] (03CR) 10Kamila Součková: [C:03+1] rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949 (owner: 10Daniel Kinzler) [09:39:13] (03PS2) 10Ayounsi: ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 [09:40:16] (03PS3) 10Ayounsi: ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 [09:40:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047', diff saved to https://phabricator.wikimedia.org/P91340 and previous config saved to /var/cache/conftool/dbconfig/20260423-094019-fceratto.json [09:40:51] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update prod image for outlinktopic model (v2 inf protocol) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276626 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos) [09:42:27] (03CR) 10Kamila Součková: [C:03+1] redioscope: add more histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276363 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler) [09:42:53] (03Merged) 10jenkins-bot: ml-services: update prod image for outlinktopic model (v2 inf protocol) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276626 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos) [09:49:31] (03PS3) 10Jcrespo: mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) [09:50:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047', diff saved to https://phabricator.wikimedia.org/P91341 and previous config saved to /var/cache/conftool/dbconfig/20260423-095027-fceratto.json [09:51:39] (03CR) 10Klausman: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [09:52:43] (03CR) 10Kamila Součková: rest-gateway: adjust rate limits (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [09:55:26] (03CR) 10Ayounsi: "Tested on Netbox-next: https://netbox-next.wikimedia.org/extras/scripts/results/304425/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 (owner: 10Ayounsi) [09:55:37] (03PS5) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) [09:55:42] (03CR) 10Daniel Kinzler: rest-gateway: adjust rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [09:55:59] (03PS6) 10Daniel Kinzler: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) [09:56:19] (03PS3) 10Daniel Kinzler: rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 [09:56:38] (03PS6) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) [09:57:21] (03PS1) 10Marostegui: db2146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276637 (https://phabricator.wikimedia.org/T424189) [09:58:27] (03CR) 10Elukey: [C:03+1] ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 (owner: 10Ayounsi) [09:58:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh5004.wikimedia.org - jmm@cumin2002" [09:58:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:58:44] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh5004.wikimedia.org on all recursors [09:58:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh5004.wikimedia.org on all recursors [09:59:22] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh5004.wikimedia.org - jmm@cumin2002" [09:59:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh5004.wikimedia.org - jmm@cumin2002" [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1000) [10:00:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047 (T419961)', diff saved to https://phabricator.wikimedia.org/P91343 and previous config saved to /var/cache/conftool/dbconfig/20260423-100035-fceratto.json [10:00:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh5004.wikimedia.org with OS bookworm [10:01:08] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [10:01:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11850573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host doh5004.wikimedia.org with OS bookworm [10:01:57] (03CR) 10Marostegui: [C:03+2] db2146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276637 (https://phabricator.wikimedia.org/T424189) (owner: 10Marostegui) [10:07:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275410 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler) [10:07:37] (03PS4) 10Ayounsi: ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 [10:08:04] (03Merged) 10jenkins-bot: api rate limits: use global apihighlimits-requestor group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275410 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler) [10:08:44] !log daniel@deploy1003 Started scap sync-world: Backport for [[gerrit:1275410|api rate limits: use global apihighlimits-requestor group. (T419796)]] [10:08:48] T419796: API rate limits: define tiers for logged-in (browser) users - https://phabricator.wikimedia.org/T419796 [10:09:49] (03CR) 10Ayounsi: [C:03+2] ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 (owner: 10Ayounsi) [10:10:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye [10:10:17] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2008.codfw.wmnet with OS bullseye [10:10:23] !log daniel@deploy1003 daniel: Backport for [[gerrit:1275410|api rate limits: use global apihighlimits-requestor group. (T419796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:11:01] (03PS1) 10Marostegui: pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276638 (https://phabricator.wikimedia.org/T424201) [10:11:39] (03CR) 10Marostegui: [C:03+2] pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276638 (https://phabricator.wikimedia.org/T424201) (owner: 10Marostegui) [10:11:50] (03CR) 10Daniel Kinzler: [C:03+2] redioscope: add more histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276363 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler) [10:12:31] !log daniel@deploy1003 daniel: Continuing with deployment [10:12:47] (03Merged) 10jenkins-bot: ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 (owner: 10Ayounsi) [10:13:22] (03CR) 10Muehlenhoff: "Ok, let's simply use openstack-trixie-flamingo and openstack-trixie-gazpacho, then" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [10:13:42] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:13:49] (03PS1) 10Marostegui: instances.yaml: Remove pc2012, add pc2022 [puppet] - 10https://gerrit.wikimedia.org/r/1276641 (https://phabricator.wikimedia.org/T424201) [10:13:50] (03Merged) 10jenkins-bot: redioscope: add more histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276363 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler) [10:14:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:14:25] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove pc2012, add pc2022 [puppet] - 10https://gerrit.wikimedia.org/r/1276641 (https://phabricator.wikimedia.org/T424201) (owner: 10Marostegui) [10:14:45] !log daniel@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [10:14:47] (03CR) 10Muehlenhoff: "I meant openstack-trixie-flamingo-backports and openstack-trixie-gazpacho-backports" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [10:14:47] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:15:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:15:08] !log daniel@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [10:15:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc2022, remove pc2012 T418973 T424201', diff saved to https://phabricator.wikimedia.org/P91345 and previous config saved to /var/cache/conftool/dbconfig/20260423-101544-marostegui.json [10:15:50] T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973 [10:15:50] T424201: decommission pc2012.codfw.wmnet - https://phabricator.wikimedia.org/T424201 [10:16:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Make pc2022 master of pc2 T418973', diff saved to https://phabricator.wikimedia.org/P91346 and previous config saved to /var/cache/conftool/dbconfig/20260423-101611-marostegui.json [10:16:21] !log daniel@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275410|api rate limits: use global apihighlimits-requestor group. (T419796)]] (duration: 07m 37s) [10:16:25] T419796: API rate limits: define tiers for logged-in (browser) users - https://phabricator.wikimedia.org/T419796 [10:16:28] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [10:17:06] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [10:17:18] (03PS1) 10Kevin Bazira: ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276643 (https://phabricator.wikimedia.org/T418350) [10:17:19] (03PS1) 10Marostegui: pc2022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276642 (https://phabricator.wikimedia.org/T418973) [10:17:38] (03Abandoned) 10Urbanecm: GrowthSuggestionToneCheck: flag as non-experimental [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269496 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [10:18:24] (03CR) 10Marostegui: [C:03+2] pc2022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276642 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [10:18:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1037.eqiad.wmnet with reason: Maintenance [10:19:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91347 and previous config saved to /var/cache/conftool/dbconfig/20260423-101855-fceratto.json [10:19:28] !log daniel@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/redioscope: apply [10:19:37] !log daniel@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/redioscope: apply [10:19:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc2 with pc2022 as codfw master T418973', diff saved to https://phabricator.wikimedia.org/P91348 and previous config saved to /var/cache/conftool/dbconfig/20260423-101957-marostegui.json [10:20:13] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:20:39] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "deploying under the assumption that this is an uncontroversial simple fix" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276402 (https://phabricator.wikimedia.org/T414376) (owner: 10Lucas Werkmeister (WMDE)) [10:20:55] * Lucas_WMDE will deploy ^ in a moment [10:21:12] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:21:35] (03PS3) 10Muehlenhoff: ganeti: Move pki::get_cert into the profile [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T424204) [10:22:55] (03PS1) 10Muehlenhoff: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) [10:22:57] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276402 (https://phabricator.wikimedia.org/T414376) (owner: 10Lucas Werkmeister (WMDE)) [10:23:31] (03CR) 10CI reject: [V:04-1] rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [10:23:43] (03PS1) 10Marostegui: pc2022: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276647 [10:23:58] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [10:24:14] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [10:24:19] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [10:24:26] (03CR) 10Kamila Součková: [C:03+1] rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [10:24:27] (03CR) 10Marostegui: [C:03+2] pc2022: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276647 (owner: 10Marostegui) [10:24:35] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [10:24:39] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [10:24:52] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [10:25:13] (03PS3) 10Daniel Kinzler: rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949 [10:25:57] jouncebot: nowandnext [10:25:57] For the next 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1000) [10:25:57] In 1 hour(s) and 34 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1200) [10:26:39] I will roll out a simple restbase change now-ish [10:27:11] !log hnowlan@deploy1003 Started deploy [restbase/deploy@8a25036]: Add urwikisource T415975 (repeat attempt, last deploy did not include change) [10:27:14] T415975: Add urwikisource to RESTBase - https://phabricator.wikimedia.org/T415975 [10:27:20] (03CR) 10Ozge: [C:03+1] ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276643 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [10:27:54] (03PS1) 10JavierMonton: alert: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) [10:28:16] * Lucas_WMDE done deploying btw [10:28:27] (03PS2) 10Kevin Bazira: ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276643 (https://phabricator.wikimedia.org/T418350) [10:28:42] (03CR) 10Kamila Součková: [C:03+1] rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 (owner: 10Daniel Kinzler) [10:31:02] (03CR) 10Kevin Bazira: [C:03+2] ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276643 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [10:32:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage [10:33:16] (03Merged) 10jenkins-bot: ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276643 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [10:33:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91351 and previous config saved to /var/cache/conftool/dbconfig/20260423-103334-fceratto.json [10:33:58] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:37:20] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:38:10] (03PS2) 10Muehlenhoff: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) [10:38:24] FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:38:45] (03CR) 10CI reject: [V:04-1] rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [10:39:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage [10:42:03] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:42:14] (03PS2) 10JavierMonton: alert: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) [10:43:01] PROBLEM - Restbase root url on restbase2033 is CRITICAL: connect to address 10.192.32.174 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [10:43:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037', diff saved to https://phabricator.wikimedia.org/P91352 and previous config saved to /var/cache/conftool/dbconfig/20260423-104343-fceratto.json [10:45:10] (03PS4) 10Daniel Kinzler: rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 [10:45:10] (03PS1) 10Daniel Kinzler: rest gateway: add suppotr for post requests in limit tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276650 (https://phabricator.wikimedia.org/T413448) [10:45:37] (03PS7) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) [10:45:45] (03PS4) 10Daniel Kinzler: rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949 [10:46:37] (03PS3) 10Muehlenhoff: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) [10:47:15] (03CR) 10CI reject: [V:04-1] rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [10:48:02] (03CR) 10Kamila Součková: [C:04-1] "need to figure out what's up with the CI diff" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [10:50:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5004.wikimedia.org with reason: host reimage [10:50:53] (03CR) 10Kamila Součková: [C:03+1] rest gateway: add suppotr for post requests in limit tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276650 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [10:51:43] (03PS4) 10Muehlenhoff: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) [10:52:16] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850797 (10MatthewVernon) [10:52:30] (03CR) 10Kamila Součková: [C:03+1] rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 (owner: 10Daniel Kinzler) [10:53:39] (03PS5) 10Muehlenhoff: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) [10:53:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037', diff saved to https://phabricator.wikimedia.org/P91353 and previous config saved to /var/cache/conftool/dbconfig/20260423-105351-fceratto.json [10:54:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh5004.wikimedia.org with reason: host reimage [10:55:56] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [10:57:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye [10:58:01] RECOVERY - Restbase root url on restbase2033 is OK: HTTP OK: HTTP/1.1 200 - 18783 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/RESTBase [10:58:03] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2008.codfw.wmnet with OS bullseye completed... [10:59:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [11:00:21] (03PS1) 10Muehlenhoff: Make doh5003/doh5004 wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1276656 (https://phabricator.wikimedia.org/T421863) [11:00:31] !log hnowlan@deploy1003 Finished deploy [restbase/deploy@8a25036]: Add urwikisource T415975 (repeat attempt, last deploy did not include change) (duration: 33m 20s) [11:00:40] T415975: Add urwikisource to RESTBase - https://phabricator.wikimedia.org/T415975 [11:00:45] jouncebot: nownadnext [11:00:50] jouncebot: nownandext [11:00:53] sigh. [11:01:13] last restbase rollout stalled on a single host, going again [11:01:16] !log hnowlan@deploy1003 Started deploy [restbase/deploy@8a25036]: Add urwikisource T415975 (repeat attempt, last deploy did not include change) [11:01:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [11:02:03] (03PS1) 10Muehlenhoff: Add netflow5003 [puppet] - 10https://gerrit.wikimedia.org/r/1276657 (https://phabricator.wikimedia.org/T421863) [11:02:58] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: add suppotr for post requests in limit tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276650 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [11:03:05] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 (owner: 10Daniel Kinzler) [11:03:11] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [11:03:16] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949 (owner: 10Daniel Kinzler) [11:04:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91354 and previous config saved to /var/cache/conftool/dbconfig/20260423-110359-fceratto.json [11:05:12] (03Merged) 10jenkins-bot: rest gateway: add suppotr for post requests in limit tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276650 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [11:05:15] (03Merged) 10jenkins-bot: rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 (owner: 10Daniel Kinzler) [11:05:20] (03Merged) 10jenkins-bot: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [11:05:46] (03Merged) 10jenkins-bot: rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949 (owner: 10Daniel Kinzler) [11:08:21] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:08:59] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:11:08] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version as the default for devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275854 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:11:18] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the test-k8s instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275855 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:11:26] (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the analytics-test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275856 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:12:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh5004.wikimedia.org with OS bookworm [11:12:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh5004.wikimedia.org [11:12:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11850852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host doh5004.wikimedia.org with OS bookworm completed: - doh5004... [11:13:05] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:13:12] !log hnowlan@deploy1003 Finished deploy [restbase/deploy@8a25036]: Add urwikisource T415975 (repeat attempt, last deploy did not include change) (duration: 11m 55s) [11:13:16] T415975: Add urwikisource to RESTBase - https://phabricator.wikimedia.org/T415975 [11:13:26] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:13:46] (03Merged) 10jenkins-bot: Deploy the new Airflow version as the default for devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275854 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:13:51] (03Merged) 10jenkins-bot: Deploy the new Airflow version to the test-k8s instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275855 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:14:17] (03Merged) 10jenkins-bot: Deploy the new Airflow version to the analytics-test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275856 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:16:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [11:16:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [11:19:25] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:20:19] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:21:01] !log installing ngtcp2 security updates [11:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2039.codfw.wmnet with reason: Maintenance [11:21:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2039 (T419961)', diff saved to https://phabricator.wikimedia.org/P91355 and previous config saved to /var/cache/conftool/dbconfig/20260423-112133-fceratto.json [11:29:55] (03CR) 10Ayounsi: [C:03+1] "lgtm but not strictly necessary" [puppet] - 10https://gerrit.wikimedia.org/r/1276657 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [11:31:05] (03CR) 10Kosta Harlan: "> There was, however, an error in this conig: I've put the param as a boolean, but it must be a string. I've updated the patch to fix that" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [11:31:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275926 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [11:31:51] (03CR) 10Muehlenhoff: [C:03+2] Add netflow5003 [puppet] - 10https://gerrit.wikimedia.org/r/1276657 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [11:33:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039 (T419961)', diff saved to https://phabricator.wikimedia.org/P91356 and previous config saved to /var/cache/conftool/dbconfig/20260423-113307-fceratto.json [11:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:34:40] (03PS1) 10Muehlenhoff: Add library hint for ngtcp2 [puppet] - 10https://gerrit.wikimedia.org/r/1276661 [11:36:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow5003.eqsin.wmnet [11:36:25] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:40:12] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow5003.eqsin.wmnet - jmm@cumin2002" [11:40:26] I'll be doing cxserver deployment. staging only. [11:42:00] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for ngtcp2 [puppet] - 10https://gerrit.wikimedia.org/r/1276661 (owner: 10Muehlenhoff) [11:42:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow5003.eqsin.wmnet - jmm@cumin2002" [11:42:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:42:12] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow5003.eqsin.wmnet on all recursors [11:42:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow5003.eqsin.wmnet on all recursors [11:42:49] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow5003.eqsin.wmnet - jmm@cumin2002" [11:42:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow5003.eqsin.wmnet - jmm@cumin2002" [11:43:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039', diff saved to https://phabricator.wikimedia.org/P91357 and previous config saved to /var/cache/conftool/dbconfig/20260423-114316-fceratto.json [11:44:10] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [11:44:55] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:44:56] (03CR) 10Filippo Giunchedi: "LGTM, I'm adding o11y folks for heads up and actual votes" [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [11:45:55] jmm@cumin2002 makevm (PID 3146970) is awaiting input [11:47:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow5003.eqsin.wmnet with OS bookworm [11:47:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11850962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow5003.eqsin.wmnet with OS bookworm [11:53:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039', diff saved to https://phabricator.wikimedia.org/P91358 and previous config saved to /var/cache/conftool/dbconfig/20260423-115324-fceratto.json [11:54:02] (03PS1) 10KartikMistry: cxserver: staging: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276667 (https://phabricator.wikimedia.org/T423002) [11:57:32] (03CR) 10KartikMistry: [C:03+2] cxserver: staging: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276667 (https://phabricator.wikimedia.org/T423002) (owner: 10KartikMistry) [11:59:18] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:59:27] (03Merged) 10jenkins-bot: cxserver: staging: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276667 (https://phabricator.wikimedia.org/T423002) (owner: 10KartikMistry) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1200) [12:00:20] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [12:00:45] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:01:42] (03PS3) 10Harroyo-wmf: hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) [12:02:50] (03CR) 10Harroyo-wmf: "A Google search for `hcaptcha "sentry=true"` suggests that this param should be put as a tring in the URL so probably yes, I'll update thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [12:03:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039 (T419961)', diff saved to https://phabricator.wikimedia.org/P91359 and previous config saved to /var/cache/conftool/dbconfig/20260423-120332-fceratto.json [12:03:54] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance [12:04:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2040 (T419961)', diff saved to https://phabricator.wikimedia.org/P91360 and previous config saved to /var/cache/conftool/dbconfig/20260423-120400-fceratto.json [12:04:10] (03PS4) 10Harroyo-wmf: hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) [12:05:10] (03CR) 10Harroyo-wmf: "Patch updated" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [12:05:14] jouncebot: nowandnext [12:05:14] For the next 0 hour(s) and 54 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1200) [12:05:14] In 0 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1300) [12:07:54] I’m going to sync some patches ahead of the window, unless there are any objections [12:08:18] (03PS1) 10Kosta Harlan: hCaptcha: Retry SiteVerify up to two times [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276671 (https://phabricator.wikimedia.org/T421204) [12:08:44] !log staging: Update cxserver to 2026-04-23-114216-production (T423002) [12:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:48] T423002: Migrate cxserver in production to node24 - https://phabricator.wikimedia.org/T423002 [12:09:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [12:10:47] (03Merged) 10jenkins-bot: hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [12:11:03] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1275429|hCaptcha: Don't prevent opening links present in the hCaptcha popup (T408812)]] [12:11:06] T408812: hCaptcha: Clicking links in Accessibility Cookie dialog does nothing - https://phabricator.wikimedia.org/T408812 [12:12:39] !log kharlan@deploy1003 harroyo-wmf, kharlan: Backport for [[gerrit:1275429|hCaptcha: Don't prevent opening links present in the hCaptcha popup (T408812)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:13:04] (03CR) 10Harroyo-wmf: "for the record: According to hCaptcha Typescript SDK it should indeed be a string:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [12:13:28] (03CR) 10Kosta Harlan: "Thanks. In retrospect, we should have updated the commit message to reflect the 'sentry' change, but that's OK." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [12:14:10] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040 (T419961)', diff saved to https://phabricator.wikimedia.org/P91361 and previous config saved to /var/cache/conftool/dbconfig/20260423-121439-fceratto.json [12:15:28] !log kharlan@deploy1003 harroyo-wmf, kharlan: Continuing with deployment [12:16:25] (03PS1) 10Ayounsi: eqsin: update netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/1276674 (https://phabricator.wikimedia.org/T421863) [12:16:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye [12:16:51] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11851096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2009.codfw.wmnet with OS bullseye [12:16:54] (03CR) 10Ayounsi: "To be deployed once netflow5003 is live" [homer/public] - 10https://gerrit.wikimedia.org/r/1276674 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [12:18:22] (03PS1) 10Kosta Harlan: hCaptcha: Disable Private Access Tokens in secure-api URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276675 (https://phabricator.wikimedia.org/T424216) [12:19:14] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275429|hCaptcha: Don't prevent opening links present in the hCaptcha popup (T408812)]] (duration: 08m 11s) [12:19:19] T408812: hCaptcha: Clicking links in Accessibility Cookie dialog does nothing - https://phabricator.wikimedia.org/T408812 [12:20:06] (03PS1) 10Muehlenhoff: rsyslog/toil: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276676 (https://phabricator.wikimedia.org/T424204) [12:20:13] (03CR) 10Ayounsi: [C:03+2] remove sandbox1-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1275926 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [12:20:22] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Disable Private Access Tokens in secure-api URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276675 (https://phabricator.wikimedia.org/T424216) (owner: 10Kosta Harlan) [12:21:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276675 (https://phabricator.wikimedia.org/T424216) (owner: 10Kosta Harlan) [12:22:52] (03Merged) 10jenkins-bot: hCaptcha: Disable Private Access Tokens in secure-api URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276675 (https://phabricator.wikimedia.org/T424216) (owner: 10Kosta Harlan) [12:23:08] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1276675|hCaptcha: Disable Private Access Tokens in secure-api URL (T424216)]] [12:23:12] T424216: hCaptcha: Set pat=off in hCaptcha secure-api.js URL settings - https://phabricator.wikimedia.org/T424216 [12:24:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040', diff saved to https://phabricator.wikimedia.org/P91362 and previous config saved to /var/cache/conftool/dbconfig/20260423-122448-fceratto.json [12:24:50] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1276675|hCaptcha: Disable Private Access Tokens in secure-api URL (T424216)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:25:40] (03PS1) 10Ilias Sarantopoulos: Add gRPC port to kserve-inference NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276681 (https://phabricator.wikimedia.org/T423582) [12:26:23] !log kharlan@deploy1003 kharlan: Continuing with deployment [12:30:06] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276675|hCaptcha: Disable Private Access Tokens in secure-api URL (T424216)]] (duration: 06m 57s) [12:30:10] T424216: hCaptcha: Set pat=off in hCaptcha secure-api.js URL settings - https://phabricator.wikimedia.org/T424216 [12:30:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276676 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [12:30:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276671 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [12:30:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow5003.eqsin.wmnet with reason: host reimage [12:31:49] (03Merged) 10jenkins-bot: hCaptcha: Retry SiteVerify up to two times [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276671 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [12:32:04] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1276671|hCaptcha: Retry SiteVerify up to two times (T421204)]] [12:33:38] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1276671|hCaptcha: Retry SiteVerify up to two times (T421204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:34:45] !log kharlan@deploy1003 kharlan: Continuing with deployment [12:34:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040', diff saved to https://phabricator.wikimedia.org/P91363 and previous config saved to /var/cache/conftool/dbconfig/20260423-123456-fceratto.json [12:36:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow5003.eqsin.wmnet with reason: host reimage [12:37:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:38:30] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276671|hCaptcha: Retry SiteVerify up to two times (T421204)]] (duration: 06m 25s) [12:39:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage [12:40:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:44:03] jouncebot: nowandnext [12:44:03] For the next 0 hour(s) and 15 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1200) [12:44:03] In 0 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1300) [12:45:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040 (T419961)', diff saved to https://phabricator.wikimedia.org/P91365 and previous config saved to /var/cache/conftool/dbconfig/20260423-124504-fceratto.json [12:45:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage [12:45:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2048.codfw.wmnet with reason: Maintenance [12:45:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2048 (T419961)', diff saved to https://phabricator.wikimedia.org/P91366 and previous config saved to /var/cache/conftool/dbconfig/20260423-124535-fceratto.json [12:48:12] Amir1: I’m done with my deploys [12:52:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2048 (T419961)', diff saved to https://phabricator.wikimedia.org/P91367 and previous config saved to /var/cache/conftool/dbconfig/20260423-125247-fceratto.json [12:53:15] (03CR) 10Muehlenhoff: [C:03+2] ganeti: Move pki::get_cert into the profile [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [12:55:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow5003.eqsin.wmnet with OS bookworm [12:55:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow5003.eqsin.wmnet [12:55:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851253 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow5003.eqsin.wmnet with OS bookworm completed: - netflo... [12:59:04] kostajh: thanks, but now I need to go to meetings, will do it afterwards. [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1300). [13:00:04] aude and HouseOfM: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:00:16] hi [13:00:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir5003.eqsin.wmnet [13:00:21] o/ [13:00:23] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:00:49] aude: go ahead with your change, I think :) [13:01:05] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1015.eqiad.wmnet with reason: Decommissioning — T412830 [13:01:07] is HouseOfM here? [13:01:09] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [13:01:30] (03PS1) 10C. Scott Ananian: Parsoid Read Views: 100% rollout to Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276697 (https://phabricator.wikimedia.org/T423188) [13:01:32] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:01:36] not so far, it looks like [13:01:48] ok then I will deploy mine [13:01:56] I would do that config change separately anyway, feels a bit risky [13:02:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276697 (https://phabricator.wikimedia.org/T423188) (owner: 10C. Scott Ananian) [13:02:13] (though according to jhs’ comment it should probably be fine) [13:02:28] o/ [13:02:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:02:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [13:02:45] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [13:02:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2048', diff saved to https://phabricator.wikimedia.org/P91368 and previous config saved to /var/cache/conftool/dbconfig/20260423-130255-fceratto.json [13:03:05] aude: I guess you could deploy cscott’s change together with yours, if you like [13:03:24] yeah should be safe [13:03:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye [13:03:35] ah didn't see [13:03:39] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11851265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2009.codfw.wmnet with OS bullseye completed... [13:04:04] it just came in ^^ [13:04:10] aude: no worries, i was late :) [13:04:41] i can batch them [13:04:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276697 (https://phabricator.wikimedia.org/T423188) (owner: 10C. Scott Ananian) [13:04:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276021 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [13:05:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:05:34] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:05:47] (03PS1) 10Xcollazo: Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276699 (https://phabricator.wikimedia.org/T417694) [13:06:08] jmm@cumin2002 makevm (PID 3208625) is awaiting input [13:06:59] (03Merged) 10jenkins-bot: Parsoid Read Views: 100% rollout to Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276697 (https://phabricator.wikimedia.org/T423188) (owner: 10C. Scott Ananian) [13:07:02] (03CR) 10CI reject: [V:04-1] Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276021 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [13:07:32] checking what is wrong [13:08:12] aude: looks like T419488 to me :/ [13:08:12] T419488: PostBuild changing the status of successful builds to failure for no apparent reason - https://phabricator.wikimedia.org/T419488 [13:08:18] safe to retry IMHO [13:08:33] ok [13:08:39] yeah seems unrelated to my change [13:08:50] castor-save-workspace-cache failed, yeah, it's been doing that. [13:08:50] tjere [13:09:00] there's a retry button on spiderpig that you can just click and it should work [13:09:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276021 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [13:10:41] (03Merged) 10jenkins-bot: Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276021 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [13:11:35] they seem merged [13:11:48] but spiderpig says error [13:12:16] do i retry to have it continue? [13:12:52] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir5003.eqsin.wmnet - jmm@cumin2002" [13:13:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2048', diff saved to https://phabricator.wikimedia.org/P91369 and previous config saved to /var/cache/conftool/dbconfig/20260423-131303-fceratto.json [13:13:48] (03PS1) 10Marostegui: db2252: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276701 [13:14:47] aude: yes [13:14:49] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11851291 (10Papaul) @ssingh hello just wanted to let you and your team that we have decided to do the switch refresh starting May 4th to May 6th ( 3 days) - Fi... [13:14:51] ok [13:15:20] !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1276697|Parsoid Read Views: 100% rollout to Russian Wikipedia (T423188)]], [[gerrit:1276021|Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis (T420881)]] [13:15:23] it should jump back to the "waiting for merge" and should C+2 the stuck patch again. at least in my experience. [13:15:26] T423188: Parsoid Read Views to deploy ~2026-04-16 - https://phabricator.wikimedia.org/T423188 [13:15:26] T420881: [Reading list web beta] Deploy beta feature to all wikipedias - https://phabricator.wikimedia.org/T420881 [13:15:57] jmm@cumin2002 makevm (PID 3208625) is awaiting input [13:16:56] !log aude@deploy1003 cscott, aude: Backport for [[gerrit:1276697|Parsoid Read Views: 100% rollout to Russian Wikipedia (T423188)]], [[gerrit:1276021|Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis (T420881)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:17:05] please check [13:17:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [13:17:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [13:17:58] mine looks good [13:18:08] yup looks good [13:18:11] thanks [13:18:15] !log aude@deploy1003 cscott, aude: Continuing with deployment [13:21:09] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [13:21:12] (03PS1) 10JavierMonton: alerts: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) [13:21:41] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [13:22:01] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [13:22:02] !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276697|Parsoid Read Views: 100% rollout to Russian Wikipedia (T423188)]], [[gerrit:1276021|Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis (T420881)]] (duration: 06m 42s) [13:22:09] T423188: Parsoid Read Views to deploy ~2026-04-16 - https://phabricator.wikimedia.org/T423188 [13:22:09] T420881: [Reading list web beta] Deploy beta feature to all wikipedias - https://phabricator.wikimedia.org/T420881 [13:22:17] all done [13:22:29] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [13:22:33] thanks for deploying aude! [13:22:38] aude: thanks! [13:22:39] np [13:22:55] I’ll try pinging HouseOfM on slack [13:23:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2048 (T419961)', diff saved to https://phabricator.wikimedia.org/P91370 and previous config saved to /var/cache/conftool/dbconfig/20260423-132311-fceratto.json [13:24:45] o/ [13:25:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir5003.eqsin.wmnet - jmm@cumin2002" [13:25:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:25:04] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir5003.eqsin.wmnet on all recursors [13:25:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir5003.eqsin.wmnet on all recursors [13:25:23] hi HouseOfM! [13:25:46] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir5003.eqsin.wmnet - jmm@cumin2002" [13:25:51] you need a deployer, right? or do you have spiderpig access? [13:25:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir5003.eqsin.wmnet - jmm@cumin2002" [13:26:03] I do need a deployer, if someone is available [13:26:13] sure, I can deploy [13:26:27] I would love spiderpig access but alas that isn't available to me right now [13:26:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey) [13:26:54] (03CR) 10Marostegui: [C:03+2] db2252: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276701 (owner: 10Marostegui) [13:27:05] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts tcp-proxy5001.eqsin.wmnet [13:27:40] (03PS3) 10Klausman: manifests/hiera: Move ml-serve101[45] to k8s worker role [puppet] - 10https://gerrit.wikimedia.org/r/1275814 [13:28:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851342 (10MoritzMuehlenhoff) [13:28:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5003.eqsin.wmnet with OS bookworm [13:28:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ncredir5003.eqsin.wmnet with OS bookworm [13:30:14] (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey) [13:30:31] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1266964|Enable the CampaignEvents extension on incubator (T421749)]] [13:30:35] T421749: Deploy CampaignEvents to Wikimedia Incubator - https://phabricator.wikimedia.org/T421749 [13:30:54] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8455/co" [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [13:32:05] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:32:09] !log lucaswerkmeister-wmde@deploy1003 mhorsey, lucaswerkmeister-wmde: Backport for [[gerrit:1266964|Enable the CampaignEvents extension on incubator (T421749)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:32:34] LGTM [13:32:56] !log lucaswerkmeister-wmde@deploy1003 mhorsey, lucaswerkmeister-wmde: Continuing with deployment [13:33:08] oooh, spiderpig looks different [13:33:16] the “no” option is now “roll back deployment and terminate” [13:33:34] that’s probably T225207 :) [13:33:34] T225207: Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207 [13:33:53] noice [13:36:42] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266964|Enable the CampaignEvents extension on incubator (T421749)]] (duration: 06m 11s) [13:36:46] T421749: Deploy CampaignEvents to Wikimedia Incubator - https://phabricator.wikimedia.org/T421749 [13:37:48] jmm@cumin2002 decommission (PID 3224790) is awaiting input [13:38:53] TYSM Lucas_WMDE [13:39:01] np :) [13:39:08] !log UTC afternoon backport+config window done [13:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:53] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11851429 (10jhathaway) >>! In T423282#11844960, @MoritzMuehlenhoff wrote: > Poking at this further I also noticed one other discrepancy actually: For some reason... [13:50:12] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: lists2001 has multiple bus errors - https://phabricator.wikimedia.org/T423159#11851432 (10ABran-WMF) yes it should be safe to reboot, you can proceed. Feel free to reach out, HTH [13:52:06] (03PS1) 10Marostegui: site.pp: Remove db2145 [puppet] - 10https://gerrit.wikimedia.org/r/1276707 (https://phabricator.wikimedia.org/T424177) [13:52:54] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2145.codfw.wmnet [13:53:02] (03CR) 10Marostegui: [C:03+2] site.pp: Remove db2145 [puppet] - 10https://gerrit.wikimedia.org/r/1276707 (https://phabricator.wikimedia.org/T424177) (owner: 10Marostegui) [13:56:45] (03CR) 10SBassett: [C:03+1] "From a security standpoint, in that this how we want to configure the beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/1276017 (https://phabricator.wikimedia.org/T420604) (owner: 10Ssingh) [13:59:09] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: tcp-proxy5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:59:32] (03PS1) 10Muehlenhoff: Remove tcp-proxy5001/5002 from conftool [puppet] - 10https://gerrit.wikimedia.org/r/1276709 (https://phabricator.wikimedia.org/T421863) [13:59:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: tcp-proxy5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:59:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:59:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts tcp-proxy5001.eqsin.wmnet [13:59:39] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [13:59:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851493 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `tcp-proxy5001.eqsin.wmnet` - tcp-proxy5001.eqsin.wmnet (**PA... [14:00:27] (03CR) 10Ottomata: [C:03+1] Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276699 (https://phabricator.wikimedia.org/T417694) (owner: 10Xcollazo) [14:00:41] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts tcp-proxy5002.eqsin.wmnet [14:03:50] jmm@cumin2002 decommission (PID 3247408) is awaiting input [14:05:17] marostegui@cumin1003 decommission (PID 275362) is awaiting input [14:06:21] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2145.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [14:06:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2145.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [14:06:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:06:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2145.codfw.wmnet [14:07:15] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2145.codfw.wmnet - https://phabricator.wikimedia.org/T424177#11851506 (10Marostegui) a:05Marostegui→03Jhancock.wm [14:07:20] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2145.codfw.wmnet - https://phabricator.wikimedia.org/T424177#11851511 (10Marostegui) Ready for #dc-ops [14:10:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:10:47] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir5003.eqsin.wmnet with reason: host reimage [14:13:11] (03CR) 10Elukey: [C:03+1] "LGTM! Remember two things:" [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [14:15:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir5003.eqsin.wmnet with reason: host reimage [14:16:13] jmm@cumin2002 decommission (PID 3247408) is awaiting input [14:20:02] (03CR) 10Klausman: [V:03+1] "Ack! For #3, I'd like to shoulder-surf you deploying, just to see how it's done." [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [14:21:41] (03PS1) 10Jelto: miscweb: remove config.private in wmf-navigator release values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276711 (https://phabricator.wikimedia.org/T414405) [14:22:18] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: tcp-proxy5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:22:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: tcp-proxy5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:22:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:22:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts tcp-proxy5002.eqsin.wmnet [14:22:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851546 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `tcp-proxy5002.eqsin.wmnet` - tcp-proxy5002.eqsin.wmnet (**PA... [14:23:54] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11851554 (10MatthewVernon) [14:24:58] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11851558 (10MatthewVernon) 05In progress→03Resolved All done, the most-filled `/` is now 26% full, which seems healthier. [14:26:05] (03CR) 10Jelto: [C:03+2] miscweb: remove config.private in wmf-navigator release values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276711 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:28:34] (03Merged) 10jenkins-bot: miscweb: remove config.private in wmf-navigator release values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276711 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1430) [14:33:28] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [14:33:45] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [14:33:54] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:34:18] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:35:49] (03CR) 10Mpostoronca: [C:03+2] "I trust Hector qa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [14:36:08] (03CR) 10Brouberol: [C:03+1] growthbook: Bump vendored job templ 1.0.1 → 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper) [14:36:23] (03CR) 10Mpostoronca: "Did qa locally, it passed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [14:38:24] FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir5003.eqsin.wmnet with OS bookworm [14:39:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir5003.eqsin.wmnet [14:39:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851625 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ncredir5003.eqsin.wmnet with OS bookworm completed: - ncredi... [14:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.17% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:42:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir5004.eqsin.wmnet [14:42:26] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:45:09] jouncebot: nowandnext [14:45:09] For the next 0 hour(s) and 14 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1430) [14:45:09] In 0 hour(s) and 14 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1500) [14:46:13] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir5004.eqsin.wmnet - jmm@cumin2002" [14:46:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir5004.eqsin.wmnet - jmm@cumin2002" [14:46:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:46:20] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir5004.eqsin.wmnet on all recursors [14:46:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir5004.eqsin.wmnet on all recursors [14:46:57] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir5004.eqsin.wmnet - jmm@cumin2002" [14:47:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir5004.eqsin.wmnet - jmm@cumin2002" [14:48:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5004.eqsin.wmnet with OS bookworm [14:48:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851646 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ncredir5004.eqsin.wmnet with OS bookworm [14:50:52] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11851651 (10ssingh) >>! In T408892#11851291, @Papaul wrote: > @ssingh hello just wanted to let you and your team that we have decided to do the switch refresh... [14:52:43] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11851668 (10Dzahn) Thank you all involved in getting this installed. Handing over to @Arnoldokoth [14:54:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T410589)', diff saved to https://phabricator.wikimedia.org/P91373 and previous config saved to /var/cache/conftool/dbconfig/20260423-145425-ladsgroup.json [14:54:30] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [14:55:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:55:22] (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:56:19] (03CR) 10Ssingh: [V:03+1 C:03+2] varnish: do not set CSP policy for beta [puppet] - 10https://gerrit.wikimedia.org/r/1276017 (https://phabricator.wikimedia.org/T420604) (owner: 10Ssingh) [14:57:50] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T424175 [15:00:04] Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1500) [15:01:12] jouncebot: next [15:01:12] In 0 hour(s) and 58 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1600) [15:03:30] !log installing rsync security updates [15:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P91374 and previous config saved to /var/cache/conftool/dbconfig/20260423-150433-ladsgroup.json [15:06:42] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T424175 [15:07:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11851759 (10Jgreen) >>! In T418928#11841638, @Jclark-ctr wrote: > @Jgreen I have not received any updates on mgmt usernames, but I have a feeling we will not be able to use “roo... [15:07:36] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T424175 [15:11:46] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276719 (https://phabricator.wikimedia.org/T418979) [15:12:15] (03PS4) 10Jcrespo: mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) [15:12:27] (03PS5) 10Jcrespo: mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) [15:12:59] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276719 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [15:13:25] (03PS1) 10Marostegui: installserver: Do not format db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276720 [15:14:28] (03CR) 10Marostegui: "Jaime, feel free to merge whenever you want." [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (owner: 10Marostegui) [15:14:32] (03CR) 10Jcrespo: "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (owner: 10Marostegui) [15:14:38] (03CR) 10Jcrespo: [C:03+1] installserver: Do not format db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (owner: 10Marostegui) [15:14:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P91375 and previous config saved to /var/cache/conftool/dbconfig/20260423-151441-ladsgroup.json [15:14:46] (03CR) 10Jcrespo: [C:03+1] "Minor spelling of Bug: header" [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (owner: 10Marostegui) [15:15:11] (03PS2) 10Marostegui: installserver: Do not format db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (https://phabricator.wikimedia.org/T418979) [15:15:14] (03PS3) 10Jcrespo: installserver: Do not format db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [15:16:41] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T424175 [15:21:17] (03CR) 10AKhatun: [C:03+1] EventStreamConfig - add rc0 streams for html and feature count change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276397 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata) [15:24:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T410589)', diff saved to https://phabricator.wikimedia.org/P91377 and previous config saved to /var/cache/conftool/dbconfig/20260423-152450-ladsgroup.json [15:24:55] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:25:00] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [15:25:07] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:25:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2167 (T410589)', diff saved to https://phabricator.wikimedia.org/P91378 and previous config saved to /var/cache/conftool/dbconfig/20260423-152514-ladsgroup.json [15:25:43] (03CR) 10Jcrespo: [C:03+2] mariadb: Reenable notifications for db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276719 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [15:27:56] (03CR) 10Jasmine: [C:03+2] service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [15:30:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir5004.eqsin.wmnet with reason: host reimage [15:32:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:34:23] jasmine_: you will need to rever that patch please [15:34:25] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/59a4a67a9072541cebd4c36cca1a92125b340da1%5E%21/#F0 [15:34:27] jasmine_: your change breaks Puppet, see e.g. https://puppetboard.wikimedia.org/report/cirrussearch1110.eqiad.wmnet/7d5df5a5bc002b9dcfacb9301d96a1a68dc576f6 [15:34:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir5004.eqsin.wmnet with reason: host reimage [15:34:53] and we do that rollout in steps so once we get to it on Monday, we will do it in that procedure [15:35:24] (03PS1) 10Jasmine: Revert "service::catalog: add sophroid service catalog entry" [puppet] - 10https://gerrit.wikimedia.org/r/1276723 [15:35:43] (03CR) 10Klausman: [C:03+2] home/klausman: fix c&p error on tmuxp config [puppet] - 10https://gerrit.wikimedia.org/r/1272658 (owner: 10Klausman) [15:36:21] (03CR) 10Jasmine: [C:03+2] Revert "service::catalog: add sophroid service catalog entry" [puppet] - 10https://gerrit.wikimedia.org/r/1276723 (owner: 10Jasmine) [15:37:45] FIRING: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:38:04] Revert in progress, apologies about that [15:38:18] reverted [15:38:57] no worries! [15:41:15] (03PS1) 10Elukey: admin_ng: simplify the deployment of kserve crd resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276726 [15:48:11] (03CR) 10Kamila Součková: "Fixed in I5450ae054cf3b555b228fec72383e58ebc853d5b. Many thanks to @ltoscano@wikimedia.org <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [15:48:35] !log sudo cumin -b31 "A:cp and not P{cp2041* or cp2042*}" "run-puppet-agent --enable 'merging CR 1276017'" T420604. finish rollout of removing CSP in VCL from beta [15:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:40] T420604: Deduplicate CSP between VCL and MediaWiki - https://phabricator.wikimedia.org/T420604 [15:48:42] (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [15:52:01] 06SRE, 10SRE-Access-Requests: Add Papaul FIDO backup SSH key - https://phabricator.wikimedia.org/T423293#11851990 (10jasmine_) 05Open→03Resolved Resolving, thanks! [15:54:09] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11851993 (10ssingh) Discussed with @Papaul a bit -- we will depool the site for all three days, just to be on the safe side and since it's ulsfo, one extra day... [15:54:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir5004.eqsin.wmnet with OS bookworm [15:54:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir5004.eqsin.wmnet [15:54:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851995 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ncredir5004.eqsin.wmnet with OS bookworm completed: - ncredi... [15:55:59] widespread puppet failure in codfw resolving, thanks jasmine_! [16:00:04] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:24] (03PS1) 10Krinkle: ext.wikiEditor: Set background-size for toolbar buttons [extensions/WikiEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276727 (https://phabricator.wikimedia.org/T414805) [16:00:31] thanks sukhe! appreciate the quick call too moritzm [16:01:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/WikiEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276727 (https://phabricator.wikimedia.org/T414805) (owner: 10Krinkle) [16:01:23] (03PS1) 10BryanDavis: developer-portal: Bump container to 2026-04-23-122614-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276728 [16:02:04] (03CR) 10Klausman: [C:03+1] admin_ng: simplify the deployment of kserve crd resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276726 (owner: 10Elukey) [16:02:35] (03CR) 10Elukey: [C:03+1] "Really nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [16:03:18] (03CR) 10Elukey: [C:03+1] rsyslog/toil: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276676 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [16:03:28] (03CR) 10Elukey: [C:03+2] admin_ng: simplify the deployment of kserve crd resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276726 (owner: 10Elukey) [16:06:13] jouncebot: nowandnext [16:06:13] For the next 0 hour(s) and 53 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1600) [16:06:13] In 0 hour(s) and 53 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1700) [16:06:13] In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1700) [16:07:33] (03PS1) 10Ladsgroup: Media: Fallback to the largest standard size if an overly large one is requested [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276731 (https://phabricator.wikimedia.org/T418745) [16:07:45] FIRING: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:08:03] (03CR) 10Ladsgroup: [C:03+2] Media: Fallback to the largest standard size if an overly large one is requested [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276731 (https://phabricator.wikimedia.org/T418745) (owner: 10Ladsgroup) [16:10:37] !log herron@cumin1003 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-logging-codfw cluster: Change Confluent distribution. [16:11:26] (03CR) 10Herron: [V:03+1 C:03+2] kafka-logging: set all codfw brokers to confluent_distribution 77 [puppet] - 10https://gerrit.wikimedia.org/r/1275932 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [16:11:42] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [16:12:05] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [16:12:29] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:12:45] RESOLVED: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:12:46] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:13:21] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:13:57] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:14:10] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276731 (https://phabricator.wikimedia.org/T418745) (owner: 10Ladsgroup) [16:16:33] !log re-enabling general ban on any non-standard thumb (T414805) [16:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:38] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [16:20:19] (03Merged) 10jenkins-bot: Media: Fallback to the largest standard size if an overly large one is requested [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276731 (https://phabricator.wikimedia.org/T418745) (owner: 10Ladsgroup) [16:20:36] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1276731|Media: Fallback to the largest standard size if an overly large one is requested (T418745 T423895)]] [16:20:43] T418745: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745 [16:20:44] T423895: Panorama Template on enwiki uses non-common thumbnail sizes (due to defining image height instead of width) - https://phabricator.wikimedia.org/T423895 [16:22:11] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1276731|Media: Fallback to the largest standard size if an overly large one is requested (T418745 T423895)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:22:41] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [16:23:27] (03PS1) 10Jelto: miscweb: add volumeMounts for wmf-navigator secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276737 (https://phabricator.wikimedia.org/T414405) [16:24:44] RECOVERY - Kafka broker TLS certificate validity on kafka-logging2005 is OK: SSL OK - Certificate kafka-logging2005.codfw.wmnet valid until 2027-03-25 13:20:00 +0000 (expires in 335 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [16:24:57] (03PS2) 10Andrew Bogott: Add upstream repos for openstack flamingo and gazpacho [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) [16:24:57] (03PS2) 10Andrew Bogott: Remove openstack::[client|server]packages::flamingo::bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1276010 [16:24:57] (03PS4) 10Andrew Bogott: Openstack: get osbpo packages from apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) [16:25:26] (03CR) 10Herron: [C:03+1] rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [16:25:50] (03CR) 10Herron: [C:03+1] rsyslog/toil: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276676 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [16:26:03] (03CR) 10Andrew Bogott: Add upstream repos for openstack flamingo and gazpacho (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [16:26:29] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276731|Media: Fallback to the largest standard size if an overly large one is requested (T418745 T423895)]] (duration: 05m 53s) [16:26:39] T418745: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745 [16:26:40] T423895: Panorama Template on enwiki uses non-common thumbnail sizes (due to defining image height instead of width) - https://phabricator.wikimedia.org/T423895 [16:28:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11852179 (10bking) @ayounsi or #infrastructure-foundations , are you able to assist @Jclark-ctr with getting the device data restored to N... [16:29:29] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=0) Change Confluent distribution for Kafka A:kafka-logging-codfw cluster: Change Confluent distribution. [16:30:36] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11852181 (10herron) [16:31:06] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11852184 (10herron) Cookbook worked well! `END (PASS) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=0) Ch... [16:39:11] !log jasmine@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wikikube-ctrl2005.codfw.wmnet with reason: Downtiming to avoid page in case of race condition [16:42:39] (03PS1) 10Jasmine: Revert "wmnet: remove wikikube-ctrl2005 from SRV records" [dns] - 10https://gerrit.wikimedia.org/r/1276747 [16:43:21] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11852249 (10MMiller_WMF) I am Justin's manager and I approve this. [16:43:43] (03CR) 10Jasmine: [C:03+2] Revert "wmnet: remove wikikube-ctrl2005 from SRV records" [dns] - 10https://gerrit.wikimedia.org/r/1276747 (owner: 10Jasmine) [16:44:29] !log jasmine@dns1004 START - running authdns-update [16:46:02] !log jasmine@dns1004 END - running authdns-update [16:52:45] (03CR) 10Ssingh: [C:03+1] Make doh5003/doh5004 wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1276656 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [16:54:33] FIRING: [58x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:54:36] FIRING: [8x] CertAlmostExpired: Certificate for service doc1004.eqiad.wmnet:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:54:38] FIRING: [22x] CertAlmostExpired: Certificate for service wdqs1018:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:54:54] FIRING: CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:56:49] got a page [16:56:54] !incidents [16:56:54] 7861 (ACKED) CertAlmostExpired sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [16:57:17] Amir1: here [16:57:21] "going to expire in 9d 20h 57m 35s" ? [16:57:27] it's already ACK'ed [16:57:30] a previous page that was ACKed? [16:57:37] I just acked it [16:57:45] I go ping sre-collab [16:57:51] sounds good, thanks Amir1 [16:58:04] oh one thing [16:58:10] > FIRING: [58x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:58:22] 58 hosts having their certs expiring at the same time? [16:58:29] that is fishy [16:58:33] neat [16:58:47] is that the discovery intermediate? [16:59:13] I think that's the discovery certificate [16:59:33] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:59:37] FIRING: [14x] CertAlmostExpired: Certificate for service contint1002:1443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:59:38] FIRING: [34x] CertAlmostExpired: Certificate for service wdqs1018:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:00:04] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1700). [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1700) [17:00:36] yeah [17:02:19] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2026-04-23-122614-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276728 (owner: 10BryanDavis) [17:04:27] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2026-04-23-122614-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276728 (owner: 10BryanDavis) [17:04:38] FIRING: [34x] CertAlmostExpired: Certificate for service wdqs1018:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:06:26] moritzm: elukey: Sorry to ping but we got a page for discovery certs expiring https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired should I just create a ticket for that? [17:07:30] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:07:45] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:08:09] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:08:27] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:09:03] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:09:18] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:09:38] FIRING: [34x] CertAlmostExpired: Certificate for service wdqs1021:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:14:38] FIRING: [35x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:15:17] ah, it is the discovery ones, I was confused thinking it was let's encrypt [17:16:57] does anyone know if the new intermediates are ready for us? [17:17:31] or "use"?..I guess either works ;) . I was looking at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275960 but it's not merged yet [17:39:44] Amir1: there is a ticket for rotating the cert, unless you meant a specific one for silencing the alerts [17:42:07] ah thanks [17:43:31] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [17:44:38] FIRING: CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:49:38] FIRING: [3x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:54:38] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:13:38] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Syncing netbox hieradata to fetch BGP for new control planes - jasmine@cumin2002 - T390861" [18:13:43] T390861: wikikube-ctrl200[4-5] implementation tracking - https://phabricator.wikimedia.org/T390861 [18:16:43] jasmine@cumin2002 sync-netbox-hiera (PID 3414765) is awaiting input [18:19:14] !log jasmine@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Syncing netbox hieradata to fetch BGP for new control planes - jasmine@cumin2002 - T390861" [18:19:18] T390861: wikikube-ctrl200[4-5] implementation tracking - https://phabricator.wikimedia.org/T390861 [18:24:38] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:31:56] (03PS2) 10Herron: kafka-logging: set codfw brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1276745 (https://phabricator.wikimedia.org/T423723) [18:38:24] FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:38] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:49:38] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:59:38] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:03:40] Hi, just FYI i'm going to do a stream config deployment... [19:03:43] seems clear! [19:04:37] FIRING: [15x] CertAlmostExpired: Certificate for service contint1002:1443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:04:54] FIRING: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:05:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276699 (https://phabricator.wikimedia.org/T417694) (owner: 10Xcollazo) [19:05:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276397 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata) [19:06:05] (03Merged) 10jenkins-bot: Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276699 (https://phabricator.wikimedia.org/T417694) (owner: 10Xcollazo) [19:06:16] (03Merged) 10jenkins-bot: EventStreamConfig - add rc0 streams for html and feature count change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276397 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata) [19:06:32] !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1276699|Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' (T417694)]], [[gerrit:1276397|EventStreamConfig - add rc0 streams for html and feature count change (T423920)]] [19:06:43] T417694: Perform a one-time clean up of retained data sets in event_sanitize - https://phabricator.wikimedia.org/T417694 [19:06:44] T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920 [19:06:52] !log “ran homer on lsw1-c7-codfw and lsw1-b2-codfw following new control planes (T390861)" [19:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:58] T390861: wikikube-ctrl200[4-5] implementation tracking - https://phabricator.wikimedia.org/T390861 [19:09:37] !log jasmine@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl[2004-2005].codfw.wmnet [19:09:39] !log jasmine@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl[2004-2005].codfw.wmnet [19:14:07] !log otto@deploy1003 xcollazo, otto: Backport for [[gerrit:1276699|Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' (T417694)]], [[gerrit:1276397|EventStreamConfig - add rc0 streams for html and feature count change (T423920)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:14:14] T417694: Perform a one-time clean up of retained data sets in event_sanitize - https://phabricator.wikimedia.org/T417694 [19:14:15] T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920 [19:14:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:18:07] Hi, spiderpig / scap seems to be failing. I am getting: [19:18:07] Error: Failed to get release next in namespace mw-debug: exit status 1: Error: Kubernetes cluster unreachable: Get "https://kubemaster.svc.eqiad.wmnet:6443/version": dial tcp 10.2.2.8:6443: connect: connection refused [19:18:24] I'll ask in slack too... [19:20:25] ottomata: curious, have you retried? that endpoint is Working For Me [19:21:33] (03CR) 10Ottomata: [C:03+1] alert: mw-page-html-content-change-enrich (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [19:21:55] I did retry yeah [19:21:57] i can try again? [19:22:06] https://spiderpig.wikimedia.org/jobs/1820 [19:23:01] oh wait! the retry succeeded! [19:23:18] the logs keep reprinting so I thought it was new output about the failure [19:23:19] yeah I was gonna say :) not a spiderpig expert but that looks like it went through [19:24:08] so I think you're ready to check on mw-debug and then keep rolling when you're ready [19:24:50] !log otto@deploy1003 xcollazo, otto: Continuing with deployment [19:25:01] yup, thank you, sorry for the noise [19:25:11] all good! sorry for the hiccup [19:28:37] !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276699|Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' (T417694)]], [[gerrit:1276397|EventStreamConfig - add rc0 streams for html and feature count change (T423920)]] (duration: 22m 05s) [19:28:42] T417694: Perform a one-time clean up of retained data sets in event_sanitize - https://phabricator.wikimedia.org/T417694 [19:28:43] T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920 [19:32:01] (03CR) 10Ottomata: [C:03+1] alerts: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [19:32:51] (03CR) 10Brouberol: [C:03+1] "Confirmed out of band that kafka3.7 had been deployed to whole cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1276745 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [19:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:34:38] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:44:38] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:52:49] (03PS1) 10TChin: [eventstreams] Bump to v0.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276779 (https://phabricator.wikimedia.org/T420257) [19:59:38] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T2000) [20:00:05] Krinkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:08:06] ebernhardson: heyo, it looks like you have an mwscript-k8s Metastore.php run from Monday that never got started -- it's wedged in a bad state so I'm just going to delete it, but wanted to check first, do you still need any information off it before I do that? [20:09:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:09:38] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:14:10] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:14:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:24:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11852980 (10VRiley-WMF) a:03VRiley-WMF [20:31:13] is backport window still rolling? i have one more config patch i'd like to squeeze in [20:34:58] (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to banwiki/ganwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276786 (https://phabricator.wikimedia.org/T423785) [20:35:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276786 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian) [20:35:40] Krinkle: did you deploy your patch? [20:43:17] ok, i'm going to jump in and deploy my config change [20:44:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276786 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian) [20:45:06] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to banwiki/ganwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276786 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian) [20:45:22] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1276786|Deploy Parsoid Read Views to banwiki/ganwiki (T423785)]] [20:45:26] T423785: Parsoid Read Views to deploy ~2026-04-20 (Language Converter wikis) - https://phabricator.wikimedia.org/T423785 [20:47:01] !log cscott@deploy1003 cscott: Backport for [[gerrit:1276786|Deploy Parsoid Read Views to banwiki/ganwiki (T423785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:47:22] cscott: I did not, sorry. That's fine yeah. [20:47:39] !log cscott@deploy1003 cscott: Continuing with deployment [20:48:55] I'll let mine ride the 10min of CI meanwhile. [20:48:58] (03CR) 10Krinkle: [C:03+2] ext.wikiEditor: Set background-size for toolbar buttons [extensions/WikiEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276727 (https://phabricator.wikimedia.org/T414805) (owner: 10Krinkle) [20:51:25] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276786|Deploy Parsoid Read Views to banwiki/ganwiki (T423785)]] (duration: 06m 02s) [20:51:31] T423785: Parsoid Read Views to deploy ~2026-04-20 (Language Converter wikis) - https://phabricator.wikimedia.org/T423785 [20:54:38] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:55:07] Krinkle: over to you [20:56:21] (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: make UncoreFrequency dynamic for iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1275889 (https://phabricator.wikimedia.org/T418899) (owner: 10Elukey) [20:56:36] (03CR) 10JHathaway: [C:03+1] Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1273792 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [20:59:25] thx [20:59:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T2100) [21:00:05] (03Merged) 10jenkins-bot: ext.wikiEditor: Set background-size for toolbar buttons [extensions/WikiEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276727 (https://phabricator.wikimedia.org/T414805) (owner: 10Krinkle) [21:00:49] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]] [21:00:52] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [21:01:18] (03CR) 10JHathaway: [C:03+1] Remove puppetmaster::gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/1273790 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [21:02:26] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:03:27] !log krinkle@deploy1003 krinkle: Rolling back deployment [21:03:49] what? [21:03:54] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]] (duration: 03m 05s) [21:04:07] Oh, default [n], no mention of "y" [21:04:11] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]] [21:04:12] I just pressed enter [21:04:21] I see it now, a few lines up [21:04:31] Wee, that's new :) I'll try to remember that next time [21:05:49] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:05:56] (03CR) 10JHathaway: [C:03+2] nf_conntrack_buckets: use default value [puppet] - 10https://gerrit.wikimedia.org/r/1272774 (https://phabricator.wikimedia.org/T105307) (owner: 10JHathaway) [21:05:58] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [21:06:12] !log krinkle@deploy1003 krinkle: Continuing with deployment [21:09:38] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:09:58] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]] (duration: 05m 47s) [21:10:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:11:10] (03PS1) 10AKhatun: topic: mw-page-html-feature-counts-change-enrich and -next [puppet] - 10https://gerrit.wikimedia.org/r/1276794 (https://phabricator.wikimedia.org/T424223) [21:15:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:20:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:23:44] jhathaway: I don't immediately see why your patch would break puppet, but the timing lines up, do you see anything? ^ [21:24:06] rzl: thanks, let me look [21:25:45] ah yeah I just looked at the wrong couple of hosts with unrelated failures -- now I do see a lot of "Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/base/firewall/nf_conntrack.conf" [21:26:07] definitely me [21:26:08] lmk if you need anything [21:28:03] hmm, tried on bast4006 and it didn't throw an error on a manual run, hmm, strange [21:29:34] maybe a missing-dependency thing where it succeeds on the second run? [21:30:01] yeah, i'm going to try a second run on the failed hosts... [21:32:20] (03PS1) 10Bking: cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1276804 (https://phabricator.wikimedia.org/T422860) [21:32:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276804 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [21:34:58] (03CR) 10Ryan Kemper: [C:03+1] cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1276804 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [21:35:14] (03CR) 10Bking: [C:03+2] cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1276804 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [21:36:46] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1011.eqiad.wmnet with OS trixie [21:39:39] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 293 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1240, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 261, delayed_unassigned_shards [21:39:39] ber_of_pending_tasks: 14, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10146, active_shards_percent_as_number: 80.88714938030006 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:39:39] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 293 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1240, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 261, delayed_unassigned_shards [21:39:39] ber_of_pending_tasks: 14, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10148, active_shards_percent_as_number: 80.88714938030006 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:39:39] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 293 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1240, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 261, delayed_unassigned_shards [21:39:39] ber_of_pending_tasks: 14, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10198, active_shards_percent_as_number: 80.88714938030006 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:39:39] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 293 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1240, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 261, delayed_unassigned_shards [21:39:40] ber_of_pending_tasks: 9, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10234, active_shards_percent_as_number: 80.88714938030006 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:39:41] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 293 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1240, relocating_shards: 0, initializing_shards: 32, unassigned_shar [21:39:41] delayed_unassigned_shards: 0, number_of_pending_tasks: 10, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 11333, active_shards_percent_as_number: 80.88714938030006 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:40:31] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 268 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1364, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 261, delayed_unassigned_shards: [21:40:31] er_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 82, active_shards_percent_as_number: 83.57843137254902 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:40:31] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 268 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1364, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 261, delayed_unassigned_shards: [21:40:31] er_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 82, active_shards_percent_as_number: 83.57843137254902 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:40:31] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 257 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1394, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 249, delayed_unassigned_shard [21:40:31] mber_of_pending_tasks: 6, number_of_in_flight_fetch: 5, task_max_waiting_in_queue_millis: 37047, active_shards_percent_as_number: 84.43367655966081 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:40:39] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 263 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1369, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 257, delayed_unassigned_shards: [21:40:39] er_of_pending_tasks: 5, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 330, active_shards_percent_as_number: 83.88480392156863 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:40:39] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 263 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1369, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 256, delayed_unassigned_shards: [21:40:39] er_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 155, active_shards_percent_as_number: 83.88480392156863 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:40:41] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 263 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1369, relocating_shards: 0, initializing_shards: 8, unassigned_shard [21:40:41] delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 31, active_shards_percent_as_number: 83.88480392156863 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:40:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:41:30] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1407, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 217, delayed_unassigned_shards: 0, number_of_pending_ta [21:41:30] number_of_in_flight_fetch: 5, task_max_waiting_in_queue_millis: 36731, active_shards_percent_as_number: 86.21323529411765 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:41:30] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1407, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 217, delayed_unassigned_shards: 0, number_of_pending_ta [21:41:30] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 36765, active_shards_percent_as_number: 86.21323529411765 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:41:30] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1451, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 193, delayed_unassigned_shards: 0, number_of_pendin [21:41:31] 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 42834, active_shards_percent_as_number: 87.88612961841308 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:41:40] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1412, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_ta [21:41:40] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 45718, active_shards_percent_as_number: 86.51960784313727 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:41:40] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1412, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_ta [21:41:40] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 45751, active_shards_percent_as_number: 86.51960784313727 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:41:40] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1413, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 212, delayed_unassign [21:41:40] s: 0, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 46823, active_shards_percent_as_number: 86.58088235294117 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:43:45] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [21:43:52] rzl: it does clear on the second run [21:43:58] aha [21:44:38] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:45:40] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1309, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 212, delayed_unassigned_shards: 0, number_of_pending_t [21:45:40] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38812785388129 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:40] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1309, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 212, delayed_unassigned_shards: 0, number_of_pending_t [21:45:40] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38812785388129 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:40] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1309, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 212, delayed_unassigned_shards: 0, number_of_pending_t [21:45:40] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38812785388129 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:40] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1309, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 212, delayed_unassigned_shards: 0, number_of_pending_t [21:45:41] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38812785388129 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:41] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1309, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 212, delayed_unassig [21:45:42] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38812785388129 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:47:30] (03PS1) 10AKhatun: stream: mediawiki.page_html_feature_counts_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) [21:48:17] (03PS1) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [21:48:18] (03PS1) 10Aaron Schulz: Add wikibase.v1 module to the sandbox were it is present [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276814 (https://phabricator.wikimedia.org/T422403) [21:48:46] (03CR) 10CI reject: [V:04-1] beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) (owner: 10BryanDavis) [21:48:54] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1011.eqiad.wmnet with reason: host reimage [21:49:13] (03CR) 10CI reject: [V:04-1] stream: mediawiki.page_html_feature_counts_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [21:52:25] (03PS2) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [21:54:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1011.eqiad.wmnet with reason: host reimage [21:58:27] (03PS2) 10AKhatun: stream: mediawiki.page_html_feature_counts_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) [21:59:24] (03CR) 10AKhatun: stream: mediawiki.page_html_feature_counts_change (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [22:03:57] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:04:32] (03PS1) 10Bking: cloudelastic: set role-level hiera for OpenSearch 2/Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1276818 (https://phabricator.wikimedia.org/T422860) [22:04:56] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276818 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [22:07:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb2013 to codfw - jhancock@cumin2002" [22:08:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb2013 to codfw - jhancock@cumin2002" [22:08:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:12:04] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host rdb2013 [22:12:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb2013 [22:12:18] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host rdb2014 [22:13:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb2014 [22:13:48] (03PS1) 10Ladsgroup: QuickView: Fix relying on non-standard sizes [extensions/MediaSearch] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276819 (https://phabricator.wikimedia.org/T424032) [22:14:09] jouncebot: nowandnext [22:14:09] No deployments scheduled for the next 7 hour(s) and 45 minute(s) [22:14:09] In 7 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260424T0600) [22:14:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:14:16] noice noice [22:14:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:14:38] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [22:14:38] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [22:14:40] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [22:14:40] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [22:15:38] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [22:15:40] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/MediaSearch] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276819 (https://phabricator.wikimedia.org/T424032) (owner: 10Ladsgroup) [22:18:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:21:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:21:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1011.eqiad.wmnet with OS trixie [22:21:41] (03PS1) 10Andrew Bogott: setup_capi.sh.erb: update to resemble upstream guides for magnum-capi [puppet] - 10https://gerrit.wikimedia.org/r/1276820 [22:21:42] (03PS1) 10Andrew Bogott: Magnum: switch codfw1dev from capi-helm to magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1276821 (https://phabricator.wikimedia.org/T393782) [22:22:30] (03CR) 10CI reject: [V:04-1] Magnum: switch codfw1dev from capi-helm to magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1276821 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [22:24:20] (03Merged) 10jenkins-bot: QuickView: Fix relying on non-standard sizes [extensions/MediaSearch] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276819 (https://phabricator.wikimedia.org/T424032) (owner: 10Ladsgroup) [22:24:37] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1276819|QuickView: Fix relying on non-standard sizes (T424032)]] [22:24:41] T424032: MediaSearch results does not use the standard thumbnail sizes - https://phabricator.wikimedia.org/T424032 [22:26:10] (03PS2) 10Andrew Bogott: Magnum: switch codfw1dev from capi-helm to magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1276821 (https://phabricator.wikimedia.org/T393782) [22:26:14] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1276819|QuickView: Fix relying on non-standard sizes (T424032)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:26:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:27:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:27:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:27:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276821 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [22:28:07] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [22:28:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:28:56] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:31:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [22:31:55] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276819|QuickView: Fix relying on non-standard sizes (T424032)]] (duration: 07m 19s) [22:31:59] T424032: MediaSearch results does not use the standard thumbnail sizes - https://phabricator.wikimedia.org/T424032 [22:34:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:37:58] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:38:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:38:24] FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:38:31] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [22:45:01] jhancock@cumin2002 provision (PID 3589026) is awaiting input [22:47:05] (03PS6) 10Cwhite: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [22:48:09] FIRING: [14x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:48:31] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [22:49:38] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:51:37] (03PS3) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [22:52:54] (03CR) 10Scott French: [C:03+1] "Nice find!" [puppet] - 10https://gerrit.wikimedia.org/r/1273926 (owner: 10CDanis) [22:54:38] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:58:31] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [23:00:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [23:00:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [23:04:52] FIRING: [15x] CertAlmostExpired: Certificate for service contint1002:1443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:05:09] (03CR) 10Cwhite: [C:03+2] "PCC OK https://puppet-compiler.wmflabs.org/output/1276645/8458/" [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff) [23:05:09] FIRING: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:26:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [23:28:43] (03CR) 10Scott French: "Thanks, Chris!" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [23:31:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [23:31:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [23:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:38:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb2013.codfw.wmnet with OS trixie [23:38:37] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11853494 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb2013.codfw.wmnet with OS trixie [23:38:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb2014.codfw.wmnet with OS trixie [23:38:50] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11853495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb2014.codfw.wmnet with OS trixie [23:40:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1276828 [23:40:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1276828 (owner: 10TrainBranchBot) [23:50:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1276828 (owner: 10TrainBranchBot) [23:54:38] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:55:34] jhancock@cumin2002 reimage (PID 3624979) is awaiting input [23:59:38] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired