[00:13:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:14:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:43:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[01:09:54] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1276453
[01:09:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1276453 (owner: 10TrainBranchBot)
[01:20:06] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1276453 (owner: 10TrainBranchBot)
[01:52:45] <jinxer-wm>	 FIRING: ProbeDown: Service etherpad1004:9001 has failed probes (http_etherpad_nodejs_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1004:9001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:57:40] <jinxer-wm>	 RESOLVED: ProbeDown: Service etherpad1004:9001 has failed probes (http_etherpad_nodejs_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1004:9001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:00:53] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[02:07:06] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 12s)
[02:09:18] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:18] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:38:24] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:44:48] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T410589)', diff saved to https://phabricator.wikimedia.org/P91317 and previous config saved to /var/cache/conftool/dbconfig/20260423-024447-ladsgroup.json
[02:44:51] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[02:54:56] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P91318 and previous config saved to /var/cache/conftool/dbconfig/20260423-025455-ladsgroup.json
[03:05:04] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P91319 and previous config saved to /var/cache/conftool/dbconfig/20260423-030504-ladsgroup.json
[03:15:13] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T410589)', diff saved to https://phabricator.wikimedia.org/P91320 and previous config saved to /var/cache/conftool/dbconfig/20260423-031512-ladsgroup.json
[03:15:17] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[03:15:30] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[03:15:38] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T410589)', diff saved to https://phabricator.wikimedia.org/P91321 and previous config saved to /var/cache/conftool/dbconfig/20260423-031538-ladsgroup.json
[03:34:03] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[04:14:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:27:21] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize pc2022 [puppet] - 10https://gerrit.wikimedia.org/r/1276474 (https://phabricator.wikimedia.org/T418973)
[05:27:49] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[2012,2022].codfw.wmnet with reason: Cloning
[05:28:23] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc2012: Cloning pc2022 from pc2012
[05:28:23] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[05:28:31] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[05:28:31] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2012: Cloning pc2022 from pc2012
[05:28:48] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[2012,2022].codfw.wmnet,pc1012.eqiad.wmnet with reason: Cloning
[05:28:59] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc2022 [puppet] - 10https://gerrit.wikimedia.org/r/1276474 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui)
[05:37:02] <wikibugs>	 (03PS1) 10Marostegui: pc2012: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276477
[05:37:30] <wikibugs>	 (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1276477 (owner: 10Marostegui)
[05:37:41] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] pc2012: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276477 (owner: 10Marostegui)
[05:40:52] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276478 (https://phabricator.wikimedia.org/T418979)
[05:41:09] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2143: Cloning db2252 from db2143
[05:41:09] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[05:41:18] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[05:41:18] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2143: Cloning db2252 from db2143
[05:41:47] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2143,2252].codfw.wmnet,db1153.eqiad.wmnet with reason: Cloning
[05:42:43] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276478 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui)
[05:43:37] <wikibugs>	 (03PS1) 10Marostegui: db2143: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276479
[05:44:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2143: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276479 (owner: 10Marostegui)
[05:57:30] <logmsgbot>	 !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:35:00 on gerrit2003.wikimedia.org with reason: Gerrit maintenance
[05:57:51] <logmsgbot>	 !log jelto@cumin1003 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:35:00 on gerrit.discovery.wmnet with reason: Gerrit maintenance
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T0600)
[06:00:04] <jouncebot>	 marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T0600).
[06:00:04] <jouncebot>	 jelto: Time to do the Gerrit maintenance - T333143 deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T0600).
[06:00:05] <stashbot>	 T333143: Move Gerrit data out of root partition - https://phabricator.wikimedia.org/T333143
[06:00:20] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: migrate gerrit2003 data to /srv/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1273683 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto)
[06:04:59] <jelto>	 !log start gerrit2003 maintenance - T333143
[06:05:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:46] <jinxer-wm>	 FIRING: [7x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable
[06:10:46] <jinxer-wm>	 RESOLVED: [7x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable
[06:11:42] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1275942 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[06:12:09] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1275943 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[06:14:51] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8453/co" [puppet] - 10https://gerrit.wikimedia.org/r/1275942 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[06:24:06] <wikibugs>	 (03PS1) 10Marostegui: db2143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276508 (https://phabricator.wikimedia.org/T424171)
[06:25:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276508 (https://phabricator.wikimedia.org/T424171) (owner: 10Marostegui)
[06:28:05] <jelto>	 !log gerrit2003 maintenance finished - T333143
[06:28:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:09] <stashbot>	 T333143: Move Gerrit data out of root partition - https://phabricator.wikimedia.org/T333143
[06:28:15] <wikibugs>	 (03PS1) 10Marostegui: db2252: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276509 (https://phabricator.wikimedia.org/T418979)
[06:29:33] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Replace db2143 with db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276510 (https://phabricator.wikimedia.org/T418979)
[06:30:44] <wikibugs>	 (03PS7) 10Jelto: gerrit: migrate gerrit_site away from root partition [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb)
[06:32:35] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:04-1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb)
[06:38:24] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:48:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2252: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276509 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui)
[06:49:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Replace db2143 with db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276510 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui)
[06:52:15] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2143 from ms3, add db2252 T418979', diff saved to https://phabricator.wikimedia.org/P91326 and previous config saved to /var/cache/conftool/dbconfig/20260423-065214-marostegui.json
[06:52:19] <stashbot>	 T418979: Productionize db225[0-3] - https://phabricator.wikimedia.org/T418979
[06:53:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Make db2252 master of ms3 T418979', diff saved to https://phabricator.wikimedia.org/P91327 and previous config saved to /var/cache/conftool/dbconfig/20260423-065323-marostegui.json
[06:56:27] <wikibugs>	 (03PS1) 10Marostegui: db2251: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276512
[06:56:56] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2252: Cloning
[06:56:56] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[06:56:57] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99)
[06:56:57] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2252: Cloning
[06:58:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms1 with db2252 as new codfw master T418979', diff saved to https://phabricator.wikimedia.org/P91328 and previous config saved to /var/cache/conftool/dbconfig/20260423-065803-marostegui.json
[06:58:08] <stashbot>	 T418979: Productionize db225[0-3] - https://phabricator.wikimedia.org/T418979
[06:58:54] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2251: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276512 (owner: 10Marostegui)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:07:43] <wikibugs>	 (03CR) 10Jelto: [C:03+2] helmfile.d/miscweb: add values file for aux private secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275934 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[07:10:25] <wikibugs>	 (03Merged) 10jenkins-bot: helmfile.d/miscweb: add values file for aux private secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275934 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[07:11:42] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2145 [puppet] - 10https://gerrit.wikimedia.org/r/1276514 (https://phabricator.wikimedia.org/T424177)
[07:11:43] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Set db2250 as a new codfw s1 backup source [puppet] - 10https://gerrit.wikimedia.org/r/1276382 (https://phabricator.wikimedia.org/T418979)
[07:13:28] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276382 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo)
[07:13:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: Set db2250 as a new codfw s1 backup source [puppet] - 10https://gerrit.wikimedia.org/r/1276382 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo)
[07:13:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Add doh5003/5004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1276516 (https://phabricator.wikimedia.org/T421863)
[07:14:02] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2145 [puppet] - 10https://gerrit.wikimedia.org/r/1276514 (https://phabricator.wikimedia.org/T424177) (owner: 10Marostegui)
[07:14:44] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11850023 (10MoritzMuehlenhoff) The prometheus5003 VM is ready
[07:14:53] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Set db2250 as a new codfw s1 backup source [puppet] - 10https://gerrit.wikimedia.org/r/1276382 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo)
[07:15:00] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2145 from dbctl T424177', diff saved to https://phabricator.wikimedia.org/P91329 and previous config saved to /var/cache/conftool/dbconfig/20260423-071500-marostegui.json
[07:15:05] <stashbot>	 T424177: decommission db2145.codfw.wmnet - https://phabricator.wikimedia.org/T424177
[07:16:54] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779)
[07:20:08] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: cache_misc: apply traffic classification [puppet] - 10https://gerrit.wikimedia.org/r/1276403
[07:22:40] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye
[07:23:01] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2007.codfw.wmnet with OS bullseye
[07:24:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Apply the tcp-proxy role to tcp-proxy5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1275942 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[07:30:57] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add doh5003/5004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1276516 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[07:32:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[07:33:34] <wikibugs>	 (03PS1) 10Marostegui: db2145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276520 (https://phabricator.wikimedia.org/T424177)
[07:34:02] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[07:34:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276520 (https://phabricator.wikimedia.org/T424177) (owner: 10Marostegui)
[07:37:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[07:40:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804#11850086 (10MoritzMuehlenhoff)
[07:41:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add doh5003/5004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1276516 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[07:41:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix Cumin alias for kerberized SSH access [puppet] - 10https://gerrit.wikimedia.org/r/1275883 (owner: 10Muehlenhoff)
[07:45:49] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage
[07:48:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] firewall::service: Add a new parameter unrestricted_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[07:50:01] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage
[07:52:47] <wikibugs>	 (03PS1) 10Marostegui: installserver: Remove db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276525
[07:54:33] <wikibugs>	 (03PS1) 10Muehlenhoff: http-sso-django-login: Switch to firewall::service and restrict access [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804)
[07:54:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Remove db2252 [puppet] - 10https://gerrit.wikimedia.org/r/1276525 (owner: 10Marostegui)
[07:55:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] http-sso-django-login: Switch to firewall::service and restrict access [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[07:57:12] <wikibugs>	 (03PS1) 10Marostegui: installserver: Add pc20[21-24] [puppet] - 10https://gerrit.wikimedia.org/r/1276527 (https://phabricator.wikimedia.org/T418973)
[07:58:02] <wikibugs>	 (03PS2) 10Muehlenhoff: http-sso-django-login: Switch to firewall::service and restrict access [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804)
[07:58:23] <wikibugs>	 (03PS2) 10Marostegui: installserver: Add pc10[21-24] [puppet] - 10https://gerrit.wikimedia.org/r/1276527 (https://phabricator.wikimedia.org/T418973)
[08:01:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh5003.wikimedia.org
[08:01:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:01:35] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Add pc10[21-24] [puppet] - 10https://gerrit.wikimedia.org/r/1276527 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui)
[08:04:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh5003.wikimedia.org - jmm@cumin2002"
[08:05:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh5003.wikimedia.org - jmm@cumin2002"
[08:05:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:05:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh5003.wikimedia.org on all recursors
[08:05:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh5003.wikimedia.org on all recursors
[08:05:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh5003.wikimedia.org - jmm@cumin2002"
[08:05:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh5003.wikimedia.org - jmm@cumin2002"
[08:06:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh5003.wikimedia.org with OS bookworm
[08:07:02] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11850174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host doh5003.wikimedia.org with OS bookworm
[08:07:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11850175 (10elukey) Update: I tested the new SM firmwares for BIOS and BMC, but the latter seems leading to an inconsistent state: the update doesn't start because of a weird issu...
[08:07:17] <wikibugs>	 (03CR) 10Majavah: http-sso-django-login: Switch to firewall::service and restrict access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[08:09:31] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: migrate gerrit_site away from root partition [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb)
[08:10:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2007.codfw.wmnet with OS bullseye
[08:10:09] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850190 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2007.codfw.wmnet with OS bullseye completed...
[08:10:49] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::prometheus: Stop monitoring ingress-nginx [puppet] - 10https://gerrit.wikimedia.org/r/1276596 (https://phabricator.wikimedia.org/T392356)
[08:10:59] <wikibugs>	 (03PS8) 10Arnaudb: envoyproxy: rebuild envoy.yaml when the placeholder is created [puppet] - 10https://gerrit.wikimedia.org/r/1275827 (https://phabricator.wikimedia.org/T421827)
[08:12:31] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[08:12:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:toolforge::prometheus: Stop monitoring ingress-nginx [puppet] - 10https://gerrit.wikimedia.org/r/1276596 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[08:13:45] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[08:14:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:14:15] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::prometheus: Stop monitoring ingress-nginx [puppet] - 10https://gerrit.wikimedia.org/r/1276596 (https://phabricator.wikimedia.org/T392356)
[08:16:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: cache::haproxy: support wikilink style usernames in UAs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992) (owner: 10Giuseppe Lavagetto)
[08:17:57] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850222 (10MatthewVernon)
[08:18:08] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Stop monitoring ingress-nginx [puppet] - 10https://gerrit.wikimedia.org/r/1276596 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[08:25:44] <wikibugs>	 (03CR) 10Elukey: "True, but lookup() outside profiles have some sense only to lookup very generic variables that are supposed to be everywhere, and/or globa" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[08:28:45] <jinxer-wm>	 RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[08:28:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, adding Moritz too" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott)
[08:29:01] <wikibugs>	 (03PS7) 10Elukey: admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993)
[08:29:30] <wikibugs>	 (03CR) 10JavierMonton: [C:03+1] EventStreamConfig - add rc0 streams for html and feature count change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276397 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata)
[08:30:46] <wikibugs>	 (03CR) 10Mpostoronca: "Could you tell us how to test this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[08:31:05] <wikibugs>	 (03PS8) 10Elukey: admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993)
[08:31:35] <wikibugs>	 (03CR) 10Elukey: "All right I think I got it, lemme know if now it makes sense!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[08:32:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11850257 (10ayounsi) Those hosts are 7 years old, shouldn't they be fully decom ?  FYI, previous data are visible in https://netbox.wikime...
[08:32:07] <wikibugs>	 (03CR) 10Mpostoronca: "Is there some link to the documentation ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[08:34:18] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:35:37] <wikibugs>	 (03CR) 10Muehlenhoff: "I had been wondering about the same. It's also really unclear how the difference between the "nochange" and the standard repo actually? If" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott)
[08:36:32] <wikibugs>	 (03PS2) 10Muehlenhoff: Add tcp-proxy5003/5004 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1275943 (https://phabricator.wikimedia.org/T421863)
[08:36:37] <wikibugs>	 (03CR) 10Elukey: ganeti: Move pki::get_cert into the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff)
[08:37:36] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Pool db2250 for backups instead of db2141 [puppet] - 10https://gerrit.wikimedia.org/r/1276406 (https://phabricator.wikimedia.org/T418979)
[08:38:22] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276406 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo)
[08:39:41] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[08:40:28] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2037.codfw.wmnet with reason: Maintenance
[08:40:29] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] "https://docs.hcaptcha.com/enterprise/secure_enclave#allowpopups-parameter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[08:40:38] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91330 and previous config saved to /var/cache/conftool/dbconfig/20260423-084035-fceratto.json
[08:42:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add tcp-proxy5003/5004 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1275943 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[08:42:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[08:47:13] <wikibugs>	 (03CR) 10Muehlenhoff: ganeti: Move pki::get_cert into the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff)
[08:47:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I couldn't find any explicit documentation, though from looking at the packages in nochange my understanding is that they are required dep" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott)
[08:50:06] <wikibugs>	 (03CR) 10Muehlenhoff: "True, but the escapsulation has already been broken by the cases listed above :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[08:52:07] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: cache::haproxy: support wikilink style usernames in UAs [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992)
[08:52:08] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91333 and previous config saved to /var/cache/conftool/dbconfig/20260423-085207-fceratto.json
[08:52:50] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=tcp-proxy5003.eqsin.wmnet
[08:53:07] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=tcp-proxy5003.eqsin.wmnet
[08:53:17] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2146 [puppet] - 10https://gerrit.wikimedia.org/r/1276612 (https://phabricator.wikimedia.org/T418979)
[08:55:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5003.wikimedia.org with reason: host reimage
[08:55:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2146 [puppet] - 10https://gerrit.wikimedia.org/r/1276612 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui)
[08:56:46] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=tcp-proxy5004.eqsin.wmnet
[08:56:48] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Pool db2250 for backups instead of db2141 [puppet] - 10https://gerrit.wikimedia.org/r/1276406 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo)
[08:56:51] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=tcp-proxy5004.eqsin.wmnet
[08:58:11] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=tcp-proxy5001.eqsin.wmnet
[08:58:15] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=tcp-proxy5002.eqsin.wmnet
[08:59:49] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+1] cache::haproxy: support wikilink style usernames in UAs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992) (owner: 10Giuseppe Lavagetto)
[09:00:15] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2146 from dbctl T424179', diff saved to https://phabricator.wikimedia.org/P91334 and previous config saved to /var/cache/conftool/dbconfig/20260423-090014-marostegui.json
[09:00:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh5003.wikimedia.org with reason: host reimage
[09:00:19] <stashbot>	 T424179: Add an edit tag when someone edits another user's user CSS - https://phabricator.wikimedia.org/T424179
[09:02:16] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779)
[09:02:16] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037', diff saved to https://phabricator.wikimedia.org/P91335 and previous config saved to /var/cache/conftool/dbconfig/20260423-090216-fceratto.json
[09:06:52] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:07:33] <wikibugs>	 (03CR) 10Elukey: [C:03+1] ganeti: Move pki::get_cert into the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff)
[09:07:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey)
[09:07:52] <wikibugs>	 (03PS3) 10Mhorsey: Enable the CampaignEvents extension on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749)
[09:08:13] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276618 (https://phabricator.wikimedia.org/T416384)
[09:09:52] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:09:59] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276618 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou)
[09:11:15] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276618 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou)
[09:12:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037', diff saved to https://phabricator.wikimedia.org/P91336 and previous config saved to /var/cache/conftool/dbconfig/20260423-091224-fceratto.json
[09:13:15] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276618 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou)
[09:15:39] <wikibugs>	 (03CR) 10Harroyo-wmf: "I've tested this locally by setting `$wgHCaptchaApiUrl` in `LocalSettings.php` like this:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[09:16:32] <wikibugs>	 (03PS2) 10Harroyo-wmf: hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812)
[09:17:13] <logmsgbot>	 !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[09:19:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh5003.wikimedia.org with OS bookworm
[09:19:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh5003.wikimedia.org
[09:19:13] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11850448 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host doh5003.wikimedia.org with OS bookworm completed: - doh5003...
[09:22:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91337 and previous config saved to /var/cache/conftool/dbconfig/20260423-092232-fceratto.json
[09:22:56] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2047.codfw.wmnet with reason: Maintenance
[09:23:04] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2047 (T419961)', diff saved to https://phabricator.wikimedia.org/P91338 and previous config saved to /var/cache/conftool/dbconfig/20260423-092303-fceratto.json
[09:25:19] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update prod image for outlinktopic model (v2 inf protocol) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276626 (https://phabricator.wikimedia.org/T423582)
[09:25:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh5004.wikimedia.org
[09:25:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:25:54] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779)
[09:27:41] <logmsgbot>	 !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[09:29:13] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: update prod image for outlinktopic model (v2 inf protocol) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276626 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos)
[09:30:11] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047 (T419961)', diff saved to https://phabricator.wikimedia.org/P91339 and previous config saved to /var/cache/conftool/dbconfig/20260423-093010-fceratto.json
[09:32:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh5004.wikimedia.org - jmm@cumin2002"
[09:34:43] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version as the default for devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275854 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis)
[09:34:49] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the test-k8s instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275855 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis)
[09:35:12] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] api rate limits: use global apihighlimits-requestor group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275410 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler)
[09:35:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Deploy the new Airflow version to the analytics-test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275856 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis)
[09:35:16] <logmsgbot>	 jmm@cumin2002 makevm (PID 3057772) is awaiting input
[09:35:33] <wikibugs>	 (03PS1) 10Ayounsi: ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628
[09:37:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 (owner: 10Ayounsi)
[09:38:13] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949 (owner: 10Daniel Kinzler)
[09:39:13] <wikibugs>	 (03PS2) 10Ayounsi: ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628
[09:40:16] <wikibugs>	 (03PS3) 10Ayounsi: ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628
[09:40:20] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047', diff saved to https://phabricator.wikimedia.org/P91340 and previous config saved to /var/cache/conftool/dbconfig/20260423-094019-fceratto.json
[09:40:51] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update prod image for outlinktopic model (v2 inf protocol) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276626 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos)
[09:42:27] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] redioscope: add more histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276363 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler)
[09:42:53] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update prod image for outlinktopic model (v2 inf protocol) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276626 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos)
[09:49:31] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979)
[09:50:28] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047', diff saved to https://phabricator.wikimedia.org/P91341 and previous config saved to /var/cache/conftool/dbconfig/20260423-095027-fceratto.json
[09:51:39] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman)
[09:52:43] <wikibugs>	 (03CR) 10Kamila Součková: rest-gateway: adjust rate limits (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler)
[09:55:26] <wikibugs>	 (03CR) 10Ayounsi: "Tested on Netbox-next: https://netbox-next.wikimedia.org/extras/scripts/results/304425/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 (owner: 10Ayounsi)
[09:55:37] <wikibugs>	 (03PS5) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779)
[09:55:42] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: adjust rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler)
[09:55:59] <wikibugs>	 (03PS6) 10Daniel Kinzler: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448)
[09:56:19] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995
[09:56:38] <wikibugs>	 (03PS6) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779)
[09:57:21] <wikibugs>	 (03PS1) 10Marostegui: db2146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276637 (https://phabricator.wikimedia.org/T424189)
[09:58:27] <wikibugs>	 (03CR) 10Elukey: [C:03+1] ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 (owner: 10Ayounsi)
[09:58:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh5004.wikimedia.org - jmm@cumin2002"
[09:58:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:58:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh5004.wikimedia.org on all recursors
[09:58:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh5004.wikimedia.org on all recursors
[09:59:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh5004.wikimedia.org - jmm@cumin2002"
[09:59:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh5004.wikimedia.org - jmm@cumin2002"
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1000)
[10:00:36] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047 (T419961)', diff saved to https://phabricator.wikimedia.org/P91343 and previous config saved to /var/cache/conftool/dbconfig/20260423-100035-fceratto.json
[10:00:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh5004.wikimedia.org with OS bookworm
[10:01:08] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler)
[10:01:08] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11850573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host doh5004.wikimedia.org with OS bookworm
[10:01:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276637 (https://phabricator.wikimedia.org/T424189) (owner: 10Marostegui)
[10:07:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275410 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler)
[10:07:37] <wikibugs>	 (03PS4) 10Ayounsi: ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628
[10:08:04] <wikibugs>	 (03Merged) 10jenkins-bot: api rate limits: use global apihighlimits-requestor group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275410 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler)
[10:08:44] <logmsgbot>	 !log daniel@deploy1003 Started scap sync-world: Backport for [[gerrit:1275410|api rate limits: use global apihighlimits-requestor group. (T419796)]]
[10:08:48] <stashbot>	 T419796: API rate limits: define tiers for logged-in (browser) users - https://phabricator.wikimedia.org/T419796
[10:09:49] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 (owner: 10Ayounsi)
[10:10:09] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye
[10:10:17] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2008.codfw.wmnet with OS bullseye
[10:10:23] <logmsgbot>	 !log daniel@deploy1003 daniel: Backport for [[gerrit:1275410|api rate limits: use global apihighlimits-requestor group. (T419796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[10:11:01] <wikibugs>	 (03PS1) 10Marostegui: pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276638 (https://phabricator.wikimedia.org/T424201)
[10:11:39] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276638 (https://phabricator.wikimedia.org/T424201) (owner: 10Marostegui)
[10:11:50] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] redioscope: add more histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276363 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler)
[10:12:31] <logmsgbot>	 !log daniel@deploy1003 daniel: Continuing with deployment
[10:12:47] <wikibugs>	 (03Merged) 10jenkins-bot: ProvisionServerNetworkCSV: various improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1276628 (owner: 10Ayounsi)
[10:13:22] <wikibugs>	 (03CR) 10Muehlenhoff: "Ok, let's simply use openstack-trixie-flamingo and openstack-trixie-gazpacho, then" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott)
[10:13:42] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[10:13:49] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove pc2012, add pc2022 [puppet] - 10https://gerrit.wikimedia.org/r/1276641 (https://phabricator.wikimedia.org/T424201)
[10:13:50] <wikibugs>	 (03Merged) 10jenkins-bot: redioscope: add more histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276363 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler)
[10:14:13] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[10:14:25] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove pc2012, add pc2022 [puppet] - 10https://gerrit.wikimedia.org/r/1276641 (https://phabricator.wikimedia.org/T424201) (owner: 10Marostegui)
[10:14:45] <logmsgbot>	 !log daniel@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply
[10:14:47] <wikibugs>	 (03CR) 10Muehlenhoff: "I meant openstack-trixie-flamingo-backports and openstack-trixie-gazpacho-backports" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott)
[10:14:47] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[10:15:01] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[10:15:08] <logmsgbot>	 !log daniel@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply
[10:15:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc2022, remove pc2012 T418973 T424201', diff saved to https://phabricator.wikimedia.org/P91345 and previous config saved to /var/cache/conftool/dbconfig/20260423-101544-marostegui.json
[10:15:50] <stashbot>	 T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973
[10:15:50] <stashbot>	 T424201: decommission pc2012.codfw.wmnet - https://phabricator.wikimedia.org/T424201
[10:16:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Make pc2022 master of pc2 T418973', diff saved to https://phabricator.wikimedia.org/P91346 and previous config saved to /var/cache/conftool/dbconfig/20260423-101611-marostegui.json
[10:16:21] <logmsgbot>	 !log daniel@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275410|api rate limits: use global apihighlimits-requestor group. (T419796)]] (duration: 07m 37s)
[10:16:25] <stashbot>	 T419796: API rate limits: define tiers for logged-in (browser) users - https://phabricator.wikimedia.org/T419796
[10:16:28] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[10:17:06] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[10:17:18] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276643 (https://phabricator.wikimedia.org/T418350)
[10:17:19] <wikibugs>	 (03PS1) 10Marostegui: pc2022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276642 (https://phabricator.wikimedia.org/T418973)
[10:17:38] <wikibugs>	 (03Abandoned) 10Urbanecm: GrowthSuggestionToneCheck: flag as non-experimental [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269496 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm)
[10:18:24] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] pc2022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1276642 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui)
[10:18:48] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1037.eqiad.wmnet with reason: Maintenance
[10:19:00] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91347 and previous config saved to /var/cache/conftool/dbconfig/20260423-101855-fceratto.json
[10:19:28] <logmsgbot>	 !log daniel@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/redioscope: apply
[10:19:37] <logmsgbot>	 !log daniel@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/redioscope: apply
[10:19:58] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc2 with pc2022 as codfw master T418973', diff saved to https://phabricator.wikimedia.org/P91348 and previous config saved to /var/cache/conftool/dbconfig/20260423-101957-marostegui.json
[10:20:13] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[10:20:39] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "deploying under the assumption that this is an uncontroversial simple fix" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276402 (https://phabricator.wikimedia.org/T414376) (owner: 10Lucas Werkmeister (WMDE))
[10:20:55] * Lucas_WMDE will deploy ^ in a moment
[10:21:12] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[10:21:35] <wikibugs>	 (03PS3) 10Muehlenhoff: ganeti: Move pki::get_cert into the profile [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T424204)
[10:22:55] <wikibugs>	 (03PS1) 10Muehlenhoff: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204)
[10:22:57] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276402 (https://phabricator.wikimedia.org/T414376) (owner: 10Lucas Werkmeister (WMDE))
[10:23:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[10:23:43] <wikibugs>	 (03PS1) 10Marostegui: pc2022: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276647
[10:23:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[10:24:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[10:24:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[10:24:26] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler)
[10:24:27] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] pc2022: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276647 (owner: 10Marostegui)
[10:24:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[10:24:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[10:24:52] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[10:25:13] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949
[10:25:57] <hnowlan>	 jouncebot: nowandnext
[10:25:57] <jouncebot>	 For the next 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1000)
[10:25:57] <jouncebot>	 In 1 hour(s) and 34 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1200)
[10:26:39] <hnowlan>	 I will roll out a simple restbase change now-ish
[10:27:11] <logmsgbot>	 !log hnowlan@deploy1003 Started deploy [restbase/deploy@8a25036]: Add urwikisource T415975 (repeat attempt, last deploy did not include change)
[10:27:14] <stashbot>	 T415975: Add urwikisource to RESTBase - https://phabricator.wikimedia.org/T415975
[10:27:20] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276643 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[10:27:54] <wikibugs>	 (03PS1) 10JavierMonton: alert: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996)
[10:28:16] * Lucas_WMDE done deploying btw
[10:28:27] <wikibugs>	 (03PS2) 10Kevin Bazira: ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276643 (https://phabricator.wikimedia.org/T418350)
[10:28:42] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 (owner: 10Daniel Kinzler)
[10:31:02] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276643 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[10:32:53] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage
[10:33:16] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: enable multi-GPU setup using P2P+SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276643 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[10:33:35] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91351 and previous config saved to /var/cache/conftool/dbconfig/20260423-103334-fceratto.json
[10:33:58] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:37:20] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:38:10] <wikibugs>	 (03PS2) 10Muehlenhoff: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204)
[10:38:24] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:38:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[10:39:44] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage
[10:42:03] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:42:14] <wikibugs>	 (03PS2) 10JavierMonton: alert: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996)
[10:43:01] <icinga-wm>	 PROBLEM - Restbase root url on restbase2033 is CRITICAL: connect to address 10.192.32.174 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[10:43:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037', diff saved to https://phabricator.wikimedia.org/P91352 and previous config saved to /var/cache/conftool/dbconfig/20260423-104343-fceratto.json
[10:45:10] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995
[10:45:10] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: add suppotr for post requests in limit tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276650 (https://phabricator.wikimedia.org/T413448)
[10:45:37] <wikibugs>	 (03PS7) 10Daniel Kinzler: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779)
[10:45:45] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949
[10:46:37] <wikibugs>	 (03PS3) 10Muehlenhoff: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204)
[10:47:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[10:48:02] <wikibugs>	 (03CR) 10Kamila Součková: [C:04-1] "need to figure out what's up with the CI diff" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler)
[10:50:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5004.wikimedia.org with reason: host reimage
[10:50:53] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest gateway: add suppotr for post requests in limit tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276650 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler)
[10:51:43] <wikibugs>	 (03PS4) 10Muehlenhoff: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204)
[10:52:16] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850797 (10MatthewVernon)
[10:52:30] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 (owner: 10Daniel Kinzler)
[10:53:39] <wikibugs>	 (03PS5) 10Muehlenhoff: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204)
[10:53:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037', diff saved to https://phabricator.wikimedia.org/P91353 and previous config saved to /var/cache/conftool/dbconfig/20260423-105351-fceratto.json
[10:54:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh5004.wikimedia.org with reason: host reimage
[10:55:56] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo)
[10:57:57] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye
[10:58:01] <icinga-wm>	 RECOVERY - Restbase root url on restbase2033 is OK: HTTP OK: HTTP/1.1 200 - 18783 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[10:58:03] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11850798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2008.codfw.wmnet with OS bullseye completed...
[10:59:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[11:00:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Make doh5003/doh5004 wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1276656 (https://phabricator.wikimedia.org/T421863)
[11:00:31] <logmsgbot>	 !log hnowlan@deploy1003 Finished deploy [restbase/deploy@8a25036]: Add urwikisource T415975 (repeat attempt, last deploy did not include change) (duration: 33m 20s)
[11:00:40] <stashbot>	 T415975: Add urwikisource to RESTBase - https://phabricator.wikimedia.org/T415975
[11:00:45] <hnowlan>	 jouncebot: nownadnext
[11:00:50] <hnowlan>	 jouncebot: nownandext
[11:00:53] <hnowlan>	 sigh. 
[11:01:13] <hnowlan>	 last restbase rollout stalled on a single host, going again
[11:01:16] <logmsgbot>	 !log hnowlan@deploy1003 Started deploy [restbase/deploy@8a25036]: Add urwikisource T415975 (repeat attempt, last deploy did not include change)
[11:01:45] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[11:02:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Add netflow5003 [puppet] - 10https://gerrit.wikimedia.org/r/1276657 (https://phabricator.wikimedia.org/T421863)
[11:02:58] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: add suppotr for post requests in limit tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276650 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler)
[11:03:05] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 (owner: 10Daniel Kinzler)
[11:03:11] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler)
[11:03:16] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949 (owner: 10Daniel Kinzler)
[11:04:00] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037 (T419961)', diff saved to https://phabricator.wikimedia.org/P91354 and previous config saved to /var/cache/conftool/dbconfig/20260423-110359-fceratto.json
[11:05:12] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: add suppotr for post requests in limit tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276650 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler)
[11:05:15] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 (owner: 10Daniel Kinzler)
[11:05:20] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: adjust rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276372 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler)
[11:05:46] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949 (owner: 10Daniel Kinzler)
[11:08:21] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:08:59] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:11:08] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version as the default for devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275854 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis)
[11:11:18] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the test-k8s instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275855 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis)
[11:11:26] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Deploy the new Airflow version to the analytics-test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275856 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis)
[11:12:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh5004.wikimedia.org with OS bookworm
[11:12:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh5004.wikimedia.org
[11:12:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11850852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host doh5004.wikimedia.org with OS bookworm completed: - doh5004...
[11:13:05] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:13:12] <logmsgbot>	 !log hnowlan@deploy1003 Finished deploy [restbase/deploy@8a25036]: Add urwikisource T415975 (repeat attempt, last deploy did not include change) (duration: 11m 55s)
[11:13:16] <stashbot>	 T415975: Add urwikisource to RESTBase - https://phabricator.wikimedia.org/T415975
[11:13:26] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:13:46] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy the new Airflow version as the default for devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275854 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis)
[11:13:51] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy the new Airflow version to the test-k8s instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275855 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis)
[11:14:17] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy the new Airflow version to the analytics-test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275856 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis)
[11:16:45] <jinxer-wm>	 RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[11:16:54] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[11:19:25] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:20:19] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:21:01] <moritzm>	 !log installing ngtcp2 security updates
[11:21:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:26] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2039.codfw.wmnet with reason: Maintenance
[11:21:33] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2039 (T419961)', diff saved to https://phabricator.wikimedia.org/P91355 and previous config saved to /var/cache/conftool/dbconfig/20260423-112133-fceratto.json
[11:29:55] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm but not strictly necessary" [puppet] - 10https://gerrit.wikimedia.org/r/1276657 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[11:31:05] <wikibugs>	 (03CR) 10Kosta Harlan: "> There was, however, an error in this conig: I've put the param as a boolean, but it must be a string. I've updated the patch to fix that" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[11:31:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275926 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[11:31:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add netflow5003 [puppet] - 10https://gerrit.wikimedia.org/r/1276657 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[11:33:08] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039 (T419961)', diff saved to https://phabricator.wikimedia.org/P91356 and previous config saved to /var/cache/conftool/dbconfig/20260423-113307-fceratto.json
[11:34:03] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:34:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for ngtcp2 [puppet] - 10https://gerrit.wikimedia.org/r/1276661
[11:36:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow5003.eqsin.wmnet
[11:36:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:40:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow5003.eqsin.wmnet - jmm@cumin2002"
[11:40:26] <kart_>	 I'll be doing cxserver deployment. staging only.
[11:42:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for ngtcp2 [puppet] - 10https://gerrit.wikimedia.org/r/1276661 (owner: 10Muehlenhoff)
[11:42:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow5003.eqsin.wmnet - jmm@cumin2002"
[11:42:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:42:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow5003.eqsin.wmnet on all recursors
[11:42:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow5003.eqsin.wmnet on all recursors
[11:42:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow5003.eqsin.wmnet - jmm@cumin2002"
[11:42:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow5003.eqsin.wmnet - jmm@cumin2002"
[11:43:17] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039', diff saved to https://phabricator.wikimedia.org/P91357 and previous config saved to /var/cache/conftool/dbconfig/20260423-114316-fceratto.json
[11:44:10] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[11:44:55] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[11:44:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, I'm adding o11y folks for heads up and actual votes" [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[11:45:55] <logmsgbot>	 jmm@cumin2002 makevm (PID 3146970) is awaiting input
[11:47:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow5003.eqsin.wmnet with OS bookworm
[11:47:28] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11850962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow5003.eqsin.wmnet with OS bookworm
[11:53:25] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039', diff saved to https://phabricator.wikimedia.org/P91358 and previous config saved to /var/cache/conftool/dbconfig/20260423-115324-fceratto.json
[11:54:02] <wikibugs>	 (03PS1) 10KartikMistry: cxserver: staging: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276667 (https://phabricator.wikimedia.org/T423002)
[11:57:32] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] cxserver: staging: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276667 (https://phabricator.wikimedia.org/T423002) (owner: 10KartikMistry)
[11:59:18] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:59:27] <wikibugs>	 (03Merged) 10jenkins-bot: cxserver: staging: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276667 (https://phabricator.wikimedia.org/T423002) (owner: 10KartikMistry)
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1200)
[12:00:20] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:00:45] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[12:01:42] <wikibugs>	 (03PS3) 10Harroyo-wmf: hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812)
[12:02:50] <wikibugs>	 (03CR) 10Harroyo-wmf: "A Google search for `hcaptcha "sentry=true"` suggests that this param should be put as a tring in the URL so probably yes, I'll update thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[12:03:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039 (T419961)', diff saved to https://phabricator.wikimedia.org/P91359 and previous config saved to /var/cache/conftool/dbconfig/20260423-120332-fceratto.json
[12:03:54] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance
[12:04:01] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2040 (T419961)', diff saved to https://phabricator.wikimedia.org/P91360 and previous config saved to /var/cache/conftool/dbconfig/20260423-120400-fceratto.json
[12:04:10] <wikibugs>	 (03PS4) 10Harroyo-wmf: hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812)
[12:05:10] <wikibugs>	 (03CR) 10Harroyo-wmf: "Patch updated" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[12:05:14] <kostajh>	 jouncebot: nowandnext
[12:05:14] <jouncebot>	 For the next 0 hour(s) and 54 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1200)
[12:05:14] <jouncebot>	 In 0 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1300)
[12:07:54] <kostajh>	 I’m going to sync some patches ahead of the window, unless there are any objections
[12:08:18] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Retry SiteVerify up to two times [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276671 (https://phabricator.wikimedia.org/T421204)
[12:08:44] <kart_>	 !log staging: Update cxserver to 2026-04-23-114216-production (T423002)
[12:08:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:48] <stashbot>	 T423002: Migrate cxserver in production to node24 - https://phabricator.wikimedia.org/T423002
[12:09:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[12:10:47] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[12:11:03] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1275429|hCaptcha: Don't prevent opening links present in the hCaptcha popup (T408812)]]
[12:11:06] <stashbot>	 T408812: hCaptcha: Clicking links in Accessibility Cookie dialog does nothing - https://phabricator.wikimedia.org/T408812
[12:12:39] <logmsgbot>	 !log kharlan@deploy1003 harroyo-wmf, kharlan: Backport for [[gerrit:1275429|hCaptcha: Don't prevent opening links present in the hCaptcha popup (T408812)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:13:04] <wikibugs>	 (03CR) 10Harroyo-wmf: "for the record: According to hCaptcha Typescript SDK it should indeed be a string:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[12:13:28] <wikibugs>	 (03CR) 10Kosta Harlan: "Thanks. In retrospect, we should have updated the commit message to reflect the 'sentry' change, but that's OK." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[12:14:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:14:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040 (T419961)', diff saved to https://phabricator.wikimedia.org/P91361 and previous config saved to /var/cache/conftool/dbconfig/20260423-121439-fceratto.json
[12:15:28] <logmsgbot>	 !log kharlan@deploy1003 harroyo-wmf, kharlan: Continuing with deployment
[12:16:25] <wikibugs>	 (03PS1) 10Ayounsi: eqsin: update netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/1276674 (https://phabricator.wikimedia.org/T421863)
[12:16:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye
[12:16:51] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11851096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2009.codfw.wmnet with OS bullseye
[12:16:54] <wikibugs>	 (03CR) 10Ayounsi: "To be deployed once netflow5003 is live" [homer/public] - 10https://gerrit.wikimedia.org/r/1276674 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[12:18:22] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Disable Private Access Tokens in secure-api URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276675 (https://phabricator.wikimedia.org/T424216)
[12:19:14] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275429|hCaptcha: Don't prevent opening links present in the hCaptcha popup (T408812)]] (duration: 08m 11s)
[12:19:19] <stashbot>	 T408812: hCaptcha: Clicking links in Accessibility Cookie dialog does nothing - https://phabricator.wikimedia.org/T408812
[12:20:06] <wikibugs>	 (03PS1) 10Muehlenhoff: rsyslog/toil: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276676 (https://phabricator.wikimedia.org/T424204)
[12:20:13] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] remove sandbox1-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1275926 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[12:20:22] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Disable Private Access Tokens in secure-api URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276675 (https://phabricator.wikimedia.org/T424216) (owner: 10Kosta Harlan)
[12:21:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276675 (https://phabricator.wikimedia.org/T424216) (owner: 10Kosta Harlan)
[12:22:52] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Disable Private Access Tokens in secure-api URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276675 (https://phabricator.wikimedia.org/T424216) (owner: 10Kosta Harlan)
[12:23:08] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1276675|hCaptcha: Disable Private Access Tokens in secure-api URL (T424216)]]
[12:23:12] <stashbot>	 T424216: hCaptcha: Set pat=off in hCaptcha secure-api.js URL settings - https://phabricator.wikimedia.org/T424216
[12:24:48] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040', diff saved to https://phabricator.wikimedia.org/P91362 and previous config saved to /var/cache/conftool/dbconfig/20260423-122448-fceratto.json
[12:24:50] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1276675|hCaptcha: Disable Private Access Tokens in secure-api URL (T424216)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:25:40] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: Add gRPC port to kserve-inference NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276681 (https://phabricator.wikimedia.org/T423582)
[12:26:23] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with deployment
[12:30:06] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276675|hCaptcha: Disable Private Access Tokens in secure-api URL (T424216)]] (duration: 06m 57s)
[12:30:10] <stashbot>	 T424216: hCaptcha: Set pat=off in hCaptcha secure-api.js URL settings - https://phabricator.wikimedia.org/T424216
[12:30:23] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276676 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[12:30:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276671 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan)
[12:30:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow5003.eqsin.wmnet with reason: host reimage
[12:31:49] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Retry SiteVerify up to two times [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276671 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan)
[12:32:04] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1276671|hCaptcha: Retry SiteVerify up to two times (T421204)]]
[12:33:38] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1276671|hCaptcha: Retry SiteVerify up to two times (T421204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:34:45] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with deployment
[12:34:57] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040', diff saved to https://phabricator.wikimedia.org/P91363 and previous config saved to /var/cache/conftool/dbconfig/20260423-123456-fceratto.json
[12:36:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow5003.eqsin.wmnet with reason: host reimage
[12:37:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[12:38:30] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276671|hCaptcha: Retry SiteVerify up to two times (T421204)]] (duration: 06m 25s)
[12:39:19] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage
[12:40:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[12:44:03] <Amir1>	 jouncebot: nowandnext
[12:44:03] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1200)
[12:44:03] <jouncebot>	 In 0 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1300)
[12:45:05] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040 (T419961)', diff saved to https://phabricator.wikimedia.org/P91365 and previous config saved to /var/cache/conftool/dbconfig/20260423-124504-fceratto.json
[12:45:22] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage
[12:45:27] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2048.codfw.wmnet with reason: Maintenance
[12:45:35] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2048 (T419961)', diff saved to https://phabricator.wikimedia.org/P91366 and previous config saved to /var/cache/conftool/dbconfig/20260423-124535-fceratto.json
[12:48:12] <kostajh>	 Amir1: I’m done with my deploys
[12:52:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2048 (T419961)', diff saved to https://phabricator.wikimedia.org/P91367 and previous config saved to /var/cache/conftool/dbconfig/20260423-125247-fceratto.json
[12:53:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] ganeti: Move pki::get_cert into the profile [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[12:55:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow5003.eqsin.wmnet with OS bookworm
[12:55:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow5003.eqsin.wmnet
[12:55:31] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851253 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow5003.eqsin.wmnet with OS bookworm completed: - netflo...
[12:59:04] <Amir1>	 kostajh: thanks, but now I need to go to meetings, will do it afterwards. 
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1300).
[13:00:04] <jouncebot>	 aude and HouseOfM: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:00:16] <aude>	 hi
[13:00:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir5003.eqsin.wmnet
[13:00:21] <Lucas_WMDE>	 o/
[13:00:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:00:49] <Lucas_WMDE>	 aude: go ahead with your change, I think :)
[13:01:05] <logmsgbot>	 !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1015.eqiad.wmnet with reason: Decommissioning — T412830
[13:01:07] <aude>	 is HouseOfM here?
[13:01:09] <stashbot>	 T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830
[13:01:30] <wikibugs>	 (03PS1) 10C. Scott Ananian: Parsoid Read Views: 100% rollout to Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276697 (https://phabricator.wikimedia.org/T423188)
[13:01:32] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:01:36] <Lucas_WMDE>	 not so far, it looks like
[13:01:48] <aude>	 ok then I will deploy mine
[13:01:56] <Lucas_WMDE>	 I would do that config change separately anyway, feels a bit risky
[13:02:08] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276697 (https://phabricator.wikimedia.org/T423188) (owner: 10C. Scott Ananian)
[13:02:13] <Lucas_WMDE>	 (though according to jhs’ comment it should probably be fine)
[13:02:28] <cscott>	 o/
[13:02:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[13:02:45] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[13:02:45] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[13:02:56] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2048', diff saved to https://phabricator.wikimedia.org/P91368 and previous config saved to /var/cache/conftool/dbconfig/20260423-130255-fceratto.json
[13:03:05] <Lucas_WMDE>	 aude: I guess you could deploy cscott’s change together with yours, if you like
[13:03:24] <cscott>	 yeah should be safe
[13:03:32] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye
[13:03:35] <aude>	 ah didn't see
[13:03:39] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11851265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2009.codfw.wmnet with OS bullseye completed...
[13:04:04] <Lucas_WMDE>	 it just came in ^^
[13:04:10] <cscott>	 aude: no worries, i was late :)
[13:04:41] <aude>	 i can batch them
[13:04:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276697 (https://phabricator.wikimedia.org/T423188) (owner: 10C. Scott Ananian)
[13:04:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276021 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude)
[13:05:02] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[13:05:34] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[13:05:47] <wikibugs>	 (03PS1) 10Xcollazo: Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276699 (https://phabricator.wikimedia.org/T417694)
[13:06:08] <logmsgbot>	 jmm@cumin2002 makevm (PID 3208625) is awaiting input
[13:06:59] <wikibugs>	 (03Merged) 10jenkins-bot: Parsoid Read Views: 100% rollout to Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276697 (https://phabricator.wikimedia.org/T423188) (owner: 10C. Scott Ananian)
[13:07:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276021 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude)
[13:07:32] <aude>	 checking what is wrong
[13:08:12] <Lucas_WMDE>	 aude: looks like T419488 to me :/
[13:08:12] <stashbot>	 T419488: PostBuild changing the status of successful builds to failure for no apparent reason - https://phabricator.wikimedia.org/T419488
[13:08:18] <Lucas_WMDE>	 safe to retry IMHO
[13:08:33] <aude>	 ok
[13:08:39] <aude>	 yeah seems unrelated to my change
[13:08:50] <cscott>	 castor-save-workspace-cache failed, yeah, it's been doing that.
[13:08:50] <cscott>	 tjere
[13:09:00] <cscott>	 there's a retry button on spiderpig that you can just click and it should work
[13:09:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276021 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude)
[13:10:41] <wikibugs>	 (03Merged) 10jenkins-bot: Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276021 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude)
[13:11:35] <aude>	 they seem merged
[13:11:48] <aude>	 but spiderpig says error
[13:12:16] <aude>	 do i retry to have it continue?
[13:12:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir5003.eqsin.wmnet - jmm@cumin2002"
[13:13:04] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2048', diff saved to https://phabricator.wikimedia.org/P91369 and previous config saved to /var/cache/conftool/dbconfig/20260423-131303-fceratto.json
[13:13:48] <wikibugs>	 (03PS1) 10Marostegui: db2252: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276701
[13:14:47] <cscott>	 aude: yes
[13:14:49] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11851291 (10Papaul) @ssingh hello just wanted to let you and your team that we have decided to do the switch refresh starting May 4th to May 6th ( 3 days) - Fi...
[13:14:51] <aude>	 ok
[13:15:20] <logmsgbot>	 !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1276697|Parsoid Read Views: 100% rollout to Russian Wikipedia (T423188)]], [[gerrit:1276021|Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis (T420881)]]
[13:15:23] <cscott>	 it should jump back to the "waiting for merge" and should C+2 the stuck patch again.  at least in my experience.
[13:15:26] <stashbot>	 T423188: Parsoid Read Views to deploy ~2026-04-16 - https://phabricator.wikimedia.org/T423188
[13:15:26] <stashbot>	 T420881: [Reading list web beta] Deploy beta feature to all wikipedias - https://phabricator.wikimedia.org/T420881
[13:15:57] <logmsgbot>	 jmm@cumin2002 makevm (PID 3208625) is awaiting input
[13:16:56] <logmsgbot>	 !log aude@deploy1003 cscott, aude: Backport for [[gerrit:1276697|Parsoid Read Views: 100% rollout to Russian Wikipedia (T423188)]], [[gerrit:1276021|Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis (T420881)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:17:05] <aude>	 please check
[13:17:45] <jinxer-wm>	 RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[13:17:45] <jinxer-wm>	 RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[13:17:58] <aude>	 mine looks good
[13:18:08] <cscott>	 yup looks good
[13:18:11] <aude>	 thanks
[13:18:15] <logmsgbot>	 !log aude@deploy1003 cscott, aude: Continuing with deployment
[13:21:09] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[13:21:12] <wikibugs>	 (03PS1) 10JavierMonton: alerts: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996)
[13:21:41] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[13:22:01] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[13:22:02] <logmsgbot>	 !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276697|Parsoid Read Views: 100% rollout to Russian Wikipedia (T423188)]], [[gerrit:1276021|Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis (T420881)]] (duration: 06m 42s)
[13:22:09] <stashbot>	 T423188: Parsoid Read Views to deploy ~2026-04-16 - https://phabricator.wikimedia.org/T423188
[13:22:09] <stashbot>	 T420881: [Reading list web beta] Deploy beta feature to all wikipedias - https://phabricator.wikimedia.org/T420881
[13:22:17] <aude>	 all done
[13:22:29] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[13:22:33] <Lucas_WMDE>	 thanks for deploying aude!
[13:22:38] <cscott>	 aude: thanks!
[13:22:39] <aude>	 np
[13:22:55] <Lucas_WMDE>	 I’ll try pinging HouseOfM on slack
[13:23:12] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2048 (T419961)', diff saved to https://phabricator.wikimedia.org/P91370 and previous config saved to /var/cache/conftool/dbconfig/20260423-132311-fceratto.json
[13:24:45] <HouseOfM>	 o/
[13:25:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir5003.eqsin.wmnet - jmm@cumin2002"
[13:25:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:25:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir5003.eqsin.wmnet on all recursors
[13:25:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir5003.eqsin.wmnet on all recursors
[13:25:23] <Lucas_WMDE>	 hi HouseOfM!
[13:25:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir5003.eqsin.wmnet - jmm@cumin2002"
[13:25:51] <Lucas_WMDE>	 you need a deployer, right? or do you have spiderpig access?
[13:25:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir5003.eqsin.wmnet - jmm@cumin2002"
[13:26:03] <HouseOfM>	 I do need a deployer, if someone is available
[13:26:13] <Lucas_WMDE>	 sure, I can deploy
[13:26:27] <HouseOfM>	 I would love spiderpig access but alas that isn't available to me right now
[13:26:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey)
[13:26:54] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2252: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1276701 (owner: 10Marostegui)
[13:27:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts tcp-proxy5001.eqsin.wmnet
[13:27:40] <wikibugs>	 (03PS3) 10Klausman: manifests/hiera: Move ml-serve101[45] to k8s worker role [puppet] - 10https://gerrit.wikimedia.org/r/1275814
[13:28:15] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851342 (10MoritzMuehlenhoff)
[13:28:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5003.eqsin.wmnet with OS bookworm
[13:28:47] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ncredir5003.eqsin.wmnet with OS bookworm
[13:30:14] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey)
[13:30:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1266964|Enable the CampaignEvents extension on incubator (T421749)]]
[13:30:35] <stashbot>	 T421749: Deploy CampaignEvents to Wikimedia Incubator - https://phabricator.wikimedia.org/T421749
[13:30:54] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8455/co" [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman)
[13:32:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:32:09] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 mhorsey, lucaswerkmeister-wmde: Backport for [[gerrit:1266964|Enable the CampaignEvents extension on incubator (T421749)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:32:34] <HouseOfM>	 LGTM
[13:32:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 mhorsey, lucaswerkmeister-wmde: Continuing with deployment
[13:33:08] <Lucas_WMDE>	 oooh, spiderpig looks different
[13:33:16] <Lucas_WMDE>	 the “no” option is now “roll back deployment and terminate”
[13:33:34] <Lucas_WMDE>	 that’s probably T225207 :)
[13:33:34] <stashbot>	 T225207: Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207
[13:33:53] <HouseOfM>	 noice
[13:36:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266964|Enable the CampaignEvents extension on incubator (T421749)]] (duration: 06m 11s)
[13:36:46] <stashbot>	 T421749: Deploy CampaignEvents to Wikimedia Incubator - https://phabricator.wikimedia.org/T421749
[13:37:48] <logmsgbot>	 jmm@cumin2002 decommission (PID 3224790) is awaiting input
[13:38:53] <HouseOfM>	 TYSM Lucas_WMDE
[13:39:01] <Lucas_WMDE>	 np :)
[13:39:08] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:39:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11851429 (10jhathaway) >>! In T423282#11844960, @MoritzMuehlenhoff wrote: > Poking at this further I also noticed one other discrepancy actually: For some reason...
[13:50:12] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: lists2001 has multiple bus errors - https://phabricator.wikimedia.org/T423159#11851432 (10ABran-WMF) yes it should be safe to reboot, you can proceed. Feel free to reach out, HTH
[13:52:06] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove db2145 [puppet] - 10https://gerrit.wikimedia.org/r/1276707 (https://phabricator.wikimedia.org/T424177)
[13:52:54] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2145.codfw.wmnet
[13:53:02] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] site.pp: Remove db2145 [puppet] - 10https://gerrit.wikimedia.org/r/1276707 (https://phabricator.wikimedia.org/T424177) (owner: 10Marostegui)
[13:56:45] <wikibugs>	 (03CR) 10SBassett: [C:03+1] "From a security standpoint, in that this how we want to configure the beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/1276017 (https://phabricator.wikimedia.org/T420604) (owner: 10Ssingh)
[13:59:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: tcp-proxy5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[13:59:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove tcp-proxy5001/5002 from conftool [puppet] - 10https://gerrit.wikimedia.org/r/1276709 (https://phabricator.wikimedia.org/T421863)
[13:59:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: tcp-proxy5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[13:59:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:59:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts tcp-proxy5001.eqsin.wmnet
[13:59:39] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.dns.netbox
[13:59:52] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851493 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `tcp-proxy5001.eqsin.wmnet` - tcp-proxy5001.eqsin.wmnet (**PA...
[14:00:27] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276699 (https://phabricator.wikimedia.org/T417694) (owner: 10Xcollazo)
[14:00:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts tcp-proxy5002.eqsin.wmnet
[14:03:50] <logmsgbot>	 jmm@cumin2002 decommission (PID 3247408) is awaiting input
[14:05:17] <logmsgbot>	 marostegui@cumin1003 decommission (PID 275362) is awaiting input
[14:06:21] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2145.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
[14:06:26] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2145.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
[14:06:26] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:06:27] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2145.codfw.wmnet
[14:07:15] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2145.codfw.wmnet - https://phabricator.wikimedia.org/T424177#11851506 (10Marostegui) a:05Marostegui→03Jhancock.wm
[14:07:20] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2145.codfw.wmnet - https://phabricator.wikimedia.org/T424177#11851511 (10Marostegui) Ready for #dc-ops
[14:10:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:10:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir5003.eqsin.wmnet with reason: host reimage
[14:13:11] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM! Remember two things:" [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman)
[14:15:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir5003.eqsin.wmnet with reason: host reimage
[14:16:13] <logmsgbot>	 jmm@cumin2002 decommission (PID 3247408) is awaiting input
[14:20:02] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "Ack! For #3, I'd like to shoulder-surf you deploying, just to see how it's done." [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman)
[14:21:41] <wikibugs>	 (03PS1) 10Jelto: miscweb: remove config.private in wmf-navigator release values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276711 (https://phabricator.wikimedia.org/T414405)
[14:22:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: tcp-proxy5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:22:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: tcp-proxy5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:22:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:22:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts tcp-proxy5002.eqsin.wmnet
[14:22:50] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851546 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `tcp-proxy5002.eqsin.wmnet` - tcp-proxy5002.eqsin.wmnet (**PA...
[14:23:54] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11851554 (10MatthewVernon)
[14:24:58] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11851558 (10MatthewVernon) 05In progress→03Resolved All done, the most-filled `/` is now 26% full, which seems healthier.
[14:26:05] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: remove config.private in wmf-navigator release values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276711 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[14:28:34] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: remove config.private in wmf-navigator release values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276711 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[14:30:04] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1430)
[14:33:28] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[14:33:45] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[14:33:54] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[14:34:18] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[14:35:49] <wikibugs>	 (03CR) 10Mpostoronca: [C:03+2] "I trust Hector qa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[14:36:08] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] growthbook: Bump vendored job templ 1.0.1 → 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper)
[14:36:23] <wikibugs>	 (03CR) 10Mpostoronca: "Did qa locally, it passed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf)
[14:38:24] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:39:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir5003.eqsin.wmnet with OS bookworm
[14:39:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir5003.eqsin.wmnet
[14:39:43] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851625 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ncredir5003.eqsin.wmnet with OS bookworm completed: - ncredi...
[14:40:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.17% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:42:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir5004.eqsin.wmnet
[14:42:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:45:09] <Amir1>	 jouncebot: nowandnext
[14:45:09] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1430)
[14:45:09] <jouncebot>	 In 0 hour(s) and 14 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1500)
[14:46:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir5004.eqsin.wmnet - jmm@cumin2002"
[14:46:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir5004.eqsin.wmnet - jmm@cumin2002"
[14:46:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:46:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir5004.eqsin.wmnet on all recursors
[14:46:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir5004.eqsin.wmnet on all recursors
[14:46:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir5004.eqsin.wmnet - jmm@cumin2002"
[14:47:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir5004.eqsin.wmnet - jmm@cumin2002"
[14:48:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5004.eqsin.wmnet with OS bookworm
[14:48:58] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851646 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ncredir5004.eqsin.wmnet with OS bookworm
[14:50:52] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11851651 (10ssingh) >>! In T408892#11851291, @Papaul wrote: > @ssingh hello just wanted to let you and your team that we have decided to do the switch refresh...
[14:52:43] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11851668 (10Dzahn) Thank you all involved in getting this installed.  Handing over to @Arnoldokoth
[14:54:25] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T410589)', diff saved to https://phabricator.wikimedia.org/P91373 and previous config saved to /var/cache/conftool/dbconfig/20260423-145425-ladsgroup.json
[14:54:30] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[14:55:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:55:22] <wikibugs>	 (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[14:56:19] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] varnish: do not set CSP policy for beta [puppet] - 10https://gerrit.wikimedia.org/r/1276017 (https://phabricator.wikimedia.org/T420604) (owner: 10Ssingh)
[14:57:50] <logmsgbot>	 !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T424175
[15:00:04] <jouncebot>	 Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1500)
[15:01:12] <Amir1>	 jouncebot: next
[15:01:12] <jouncebot>	 In 0 hour(s) and 58 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1600)
[15:03:30] <moritzm>	 !log installing rsync security updates
[15:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:34] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P91374 and previous config saved to /var/cache/conftool/dbconfig/20260423-150433-ladsgroup.json
[15:06:42] <logmsgbot>	 !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T424175
[15:07:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11851759 (10Jgreen) >>! In T418928#11841638, @Jclark-ctr wrote: > @Jgreen I have not received any updates on mgmt usernames, but I have a feeling we will not be able to use “roo...
[15:07:36] <logmsgbot>	 !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T424175
[15:11:46] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276719 (https://phabricator.wikimedia.org/T418979)
[15:12:15] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979)
[15:12:27] <wikibugs>	 (03PS5) 10Jcrespo: mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979)
[15:12:59] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276719 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo)
[15:13:25] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276720
[15:14:28] <wikibugs>	 (03CR) 10Marostegui: "Jaime, feel free to merge whenever you want." [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (owner: 10Marostegui)
[15:14:32] <wikibugs>	 (03CR) 10Jcrespo: "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (owner: 10Marostegui)
[15:14:38] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] installserver: Do not format db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (owner: 10Marostegui)
[15:14:42] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P91375 and previous config saved to /var/cache/conftool/dbconfig/20260423-151441-ladsgroup.json
[15:14:46] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "Minor spelling of Bug: header" [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (owner: 10Marostegui)
[15:15:11] <wikibugs>	 (03PS2) 10Marostegui: installserver: Do not format db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (https://phabricator.wikimedia.org/T418979)
[15:15:14] <wikibugs>	 (03PS3) 10Jcrespo: installserver: Do not format db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui)
[15:16:41] <logmsgbot>	 !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T424175
[15:21:17] <wikibugs>	 (03CR) 10AKhatun: [C:03+1] EventStreamConfig - add rc0 streams for html and feature count change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276397 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata)
[15:24:50] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T410589)', diff saved to https://phabricator.wikimedia.org/P91377 and previous config saved to /var/cache/conftool/dbconfig/20260423-152450-ladsgroup.json
[15:24:55] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[15:25:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276720 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui)
[15:25:07] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[15:25:15] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2167 (T410589)', diff saved to https://phabricator.wikimedia.org/P91378 and previous config saved to /var/cache/conftool/dbconfig/20260423-152514-ladsgroup.json
[15:25:43] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Reenable notifications for db2250 [puppet] - 10https://gerrit.wikimedia.org/r/1276719 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo)
[15:27:56] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine)
[15:30:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir5004.eqsin.wmnet with reason: host reimage
[15:32:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:34:03] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:34:23] <sukhe>	 jasmine_: you will need to rever that patch please
[15:34:25] <sukhe>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/59a4a67a9072541cebd4c36cca1a92125b340da1%5E%21/#F0
[15:34:27] <moritzm>	 jasmine_: your change breaks Puppet, see e.g. https://puppetboard.wikimedia.org/report/cirrussearch1110.eqiad.wmnet/7d5df5a5bc002b9dcfacb9301d96a1a68dc576f6
[15:34:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir5004.eqsin.wmnet with reason: host reimage
[15:34:53] <sukhe>	 and we do that rollout in steps so once we get to it on Monday, we will do it in that procedure
[15:35:24] <wikibugs>	 (03PS1) 10Jasmine: Revert "service::catalog: add sophroid service catalog entry" [puppet] - 10https://gerrit.wikimedia.org/r/1276723
[15:35:43] <wikibugs>	 (03CR) 10Klausman: [C:03+2] home/klausman: fix c&p error on tmuxp config [puppet] - 10https://gerrit.wikimedia.org/r/1272658 (owner: 10Klausman)
[15:36:21] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] Revert "service::catalog: add sophroid service catalog entry" [puppet] - 10https://gerrit.wikimedia.org/r/1276723 (owner: 10Jasmine)
[15:37:45] <jinxer-wm>	 FIRING: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:38:04] <jasmine_>	 Revert in progress, apologies about that
[15:38:18] <jasmine_>	 reverted
[15:38:57] <sukhe>	 no worries!
[15:41:15] <wikibugs>	 (03PS1) 10Elukey: admin_ng: simplify the deployment of kserve crd resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276726
[15:48:11] <wikibugs>	 (03CR) 10Kamila Součková: "Fixed in I5450ae054cf3b555b228fec72383e58ebc853d5b. Many thanks to @ltoscano@wikimedia.org <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[15:48:35] <sukhe>	 !log sudo cumin -b31 "A:cp and not P{cp2041* or cp2042*}" "run-puppet-agent --enable 'merging CR 1276017'" T420604. finish rollout of removing CSP in VCL from beta
[15:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:40] <stashbot>	 T420604: Deduplicate CSP between VCL and MediaWiki - https://phabricator.wikimedia.org/T420604
[15:48:42] <wikibugs>	 (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[15:52:01] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add Papaul FIDO backup SSH key - https://phabricator.wikimedia.org/T423293#11851990 (10jasmine_) 05Open→03Resolved Resolving, thanks!
[15:54:09] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11851993 (10ssingh) Discussed with @Papaul a bit -- we will depool the site for all three days, just to be on the safe side and since it's ulsfo, one extra day...
[15:54:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir5004.eqsin.wmnet with OS bookworm
[15:54:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir5004.eqsin.wmnet
[15:54:52] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11851995 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ncredir5004.eqsin.wmnet with OS bookworm completed: - ncredi...
[15:55:59] <sukhe>	 widespread puppet failure in codfw resolving, thanks jasmine_!
[16:00:04] <jouncebot>	 jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1600).
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:24] <wikibugs>	 (03PS1) 10Krinkle: ext.wikiEditor: Set background-size for toolbar buttons [extensions/WikiEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276727 (https://phabricator.wikimedia.org/T414805)
[16:00:31] <jasmine_>	 thanks sukhe! appreciate the quick call too moritzm
[16:01:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/WikiEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276727 (https://phabricator.wikimedia.org/T414805) (owner: 10Krinkle)
[16:01:23] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2026-04-23-122614-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276728
[16:02:04] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: simplify the deployment of kserve crd resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276726 (owner: 10Elukey)
[16:02:35] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Really nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[16:03:18] <wikibugs>	 (03CR) 10Elukey: [C:03+1] rsyslog/toil: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276676 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[16:03:28] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: simplify the deployment of kserve crd resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276726 (owner: 10Elukey)
[16:06:13] <Amir1>	 jouncebot: nowandnext
[16:06:13] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1600)
[16:06:13] <jouncebot>	 In 0 hour(s) and 53 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1700)
[16:06:13] <jouncebot>	 In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1700)
[16:07:33] <wikibugs>	 (03PS1) 10Ladsgroup: Media: Fallback to the largest standard size if an overly large one is requested [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276731 (https://phabricator.wikimedia.org/T418745)
[16:07:45] <jinxer-wm>	 FIRING: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:08:03] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Media: Fallback to the largest standard size if an overly large one is requested [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276731 (https://phabricator.wikimedia.org/T418745) (owner: 10Ladsgroup)
[16:10:37] <logmsgbot>	 !log herron@cumin1003 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-logging-codfw cluster: Change Confluent distribution.
[16:11:26] <wikibugs>	 (03CR) 10Herron: [V:03+1 C:03+2] kafka-logging: set all codfw brokers to confluent_distribution 77 [puppet] - 10https://gerrit.wikimedia.org/r/1275932 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron)
[16:11:42] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[16:12:05] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[16:12:29] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[16:12:45] <jinxer-wm>	 RESOLVED: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:12:46] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[16:13:21] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[16:13:57] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[16:14:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:14:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276731 (https://phabricator.wikimedia.org/T418745) (owner: 10Ladsgroup)
[16:16:33] <Amir1>	 !log re-enabling general ban on any non-standard thumb (T414805)
[16:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:38] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[16:20:19] <wikibugs>	 (03Merged) 10jenkins-bot: Media: Fallback to the largest standard size if an overly large one is requested [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276731 (https://phabricator.wikimedia.org/T418745) (owner: 10Ladsgroup)
[16:20:36] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1276731|Media: Fallback to the largest standard size if an overly large one is requested (T418745 T423895)]]
[16:20:43] <stashbot>	 T418745: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745
[16:20:44] <stashbot>	 T423895: Panorama Template on enwiki uses non-common thumbnail sizes (due to defining image height instead of width) - https://phabricator.wikimedia.org/T423895
[16:22:11] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1276731|Media: Fallback to the largest standard size if an overly large one is requested (T418745 T423895)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:22:41] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment
[16:23:27] <wikibugs>	 (03PS1) 10Jelto: miscweb: add volumeMounts for wmf-navigator secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276737 (https://phabricator.wikimedia.org/T414405)
[16:24:44] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-logging2005 is OK: SSL OK - Certificate kafka-logging2005.codfw.wmnet valid until 2027-03-25 13:20:00 +0000 (expires in 335 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[16:24:57] <wikibugs>	 (03PS2) 10Andrew Bogott: Add upstream repos for openstack flamingo and gazpacho [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598)
[16:24:57] <wikibugs>	 (03PS2) 10Andrew Bogott: Remove openstack::[client|server]packages::flamingo::bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1276010
[16:24:57] <wikibugs>	 (03PS4) 10Andrew Bogott: Openstack: get osbpo packages from apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598)
[16:25:26] <wikibugs>	 (03CR) 10Herron: [C:03+1] rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[16:25:50] <wikibugs>	 (03CR) 10Herron: [C:03+1] rsyslog/toil: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276676 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[16:26:03] <wikibugs>	 (03CR) 10Andrew Bogott: Add upstream repos for openstack flamingo and gazpacho (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott)
[16:26:29] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276731|Media: Fallback to the largest standard size if an overly large one is requested (T418745 T423895)]] (duration: 05m 53s)
[16:26:39] <stashbot>	 T418745: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745
[16:26:40] <stashbot>	 T423895: Panorama Template on enwiki uses non-common thumbnail sizes (due to defining image height instead of width) - https://phabricator.wikimedia.org/T423895
[16:28:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11852179 (10bking) @ayounsi or #infrastructure-foundations , are you able to assist @Jclark-ctr with getting the device data restored to N...
[16:29:29] <logmsgbot>	 !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=0) Change Confluent distribution for Kafka A:kafka-logging-codfw cluster: Change Confluent distribution.
[16:30:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11852181 (10herron)
[16:31:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11852184 (10herron) Cookbook worked well!  `END (PASS) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=0) Ch...
[16:39:11] <logmsgbot>	 !log jasmine@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wikikube-ctrl2005.codfw.wmnet with reason: Downtiming to avoid page in case of race condition
[16:42:39] <wikibugs>	 (03PS1) 10Jasmine: Revert "wmnet: remove wikikube-ctrl2005 from SRV records" [dns] - 10https://gerrit.wikimedia.org/r/1276747
[16:43:21] <wikibugs>	 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11852249 (10MMiller_WMF) I am Justin's manager and I approve this.
[16:43:43] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] Revert "wmnet: remove wikikube-ctrl2005 from SRV records" [dns] - 10https://gerrit.wikimedia.org/r/1276747 (owner: 10Jasmine)
[16:44:29] <logmsgbot>	 !log jasmine@dns1004 START - running authdns-update
[16:46:02] <logmsgbot>	 !log jasmine@dns1004 END - running authdns-update
[16:52:45] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Make doh5003/doh5004 wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1276656 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[16:54:33] <jinxer-wm>	 FIRING: [58x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:54:36] <jinxer-wm>	 FIRING: [8x] CertAlmostExpired: Certificate for service doc1004.eqiad.wmnet:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:54:38] <jinxer-wm>	 FIRING: [22x] CertAlmostExpired: Certificate for service wdqs1018:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:54:54] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:56:49] <Amir1>	 got a page
[16:56:54] <Amir1>	 !incidents
[16:56:54] <sirenbot>	 7861 (ACKED)  CertAlmostExpired sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad)
[16:57:17] <Raine>	 Amir1: here
[16:57:21] <swfrench-wmf>	 "going to expire in 9d 20h 57m 35s" ?
[16:57:27] <Amir1>	 it's already ACK'ed
[16:57:30] <sukhe>	 a previous page that was ACKed?
[16:57:37] <Raine>	 I just acked it
[16:57:45] <Amir1>	 I go ping sre-collab
[16:57:51] <Raine>	 sounds good, thanks Amir1 
[16:58:04] <Amir1>	 oh one thing
[16:58:10] <Amir1>	 > FIRING: [58x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:58:22] <Amir1>	 58 hosts having their certs expiring at the same time?
[16:58:29] <Amir1>	 that is fishy
[16:58:33] <Raine>	 neat
[16:58:47] <swfrench-wmf>	 is that the discovery intermediate?
[16:59:13] <jelto>	 I think that's the discovery certificate 
[16:59:33] <jinxer-wm>	 FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:59:37] <jinxer-wm>	 FIRING: [14x] CertAlmostExpired: Certificate for service contint1002:1443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:59:38] <jinxer-wm>	 FIRING: [34x] CertAlmostExpired: Certificate for service wdqs1018:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:00:04] <jouncebot>	 bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1700).
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T1700)
[17:00:36] <Amir1>	 yeah
[17:02:19] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2026-04-23-122614-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276728 (owner: 10BryanDavis)
[17:04:27] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2026-04-23-122614-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276728 (owner: 10BryanDavis)
[17:04:38] <jinxer-wm>	 FIRING: [34x] CertAlmostExpired: Certificate for service wdqs1018:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:06:26] <Amir1>	 moritzm: elukey: Sorry to ping but we got a page for discovery certs expiring https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired should I just create a ticket for that?
[17:07:30] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:07:45] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:08:09] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:08:27] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:09:03] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:09:18] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:09:38] <jinxer-wm>	 FIRING: [34x] CertAlmostExpired: Certificate for service wdqs1021:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:14:38] <jinxer-wm>	 FIRING: [35x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:15:17] <jynus>	 ah, it is the discovery ones, I was confused thinking it was let's encrypt
[17:16:57] <inflatador>	 does anyone know if the new intermediates are ready for us?
[17:17:31] <inflatador>	 or "use"?..I guess either works ;) . I was looking at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275960 but it's not merged yet
[17:39:44] <sobanski>	 Amir1: there is a ticket for rotating the cert, unless you meant a specific one for silencing the alerts
[17:42:07] <Amir1>	 ah thanks
[17:43:31] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[17:44:38] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:49:38] <jinxer-wm>	 FIRING: [3x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:54:38] <jinxer-wm>	 FIRING: [4x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:13:38] <logmsgbot>	 !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Syncing netbox hieradata to fetch BGP for new control planes - jasmine@cumin2002 - T390861"
[18:13:43] <stashbot>	 T390861: wikikube-ctrl200[4-5] implementation tracking - https://phabricator.wikimedia.org/T390861
[18:16:43] <logmsgbot>	 jasmine@cumin2002 sync-netbox-hiera (PID 3414765) is awaiting input
[18:19:14] <logmsgbot>	 !log jasmine@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Syncing netbox hieradata to fetch BGP for new control planes - jasmine@cumin2002 - T390861"
[18:19:18] <stashbot>	 T390861: wikikube-ctrl200[4-5] implementation tracking - https://phabricator.wikimedia.org/T390861
[18:24:38] <jinxer-wm>	 FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:31:56] <wikibugs>	 (03PS2) 10Herron: kafka-logging: set codfw brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1276745 (https://phabricator.wikimedia.org/T423723)
[18:38:24] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:39:38] <jinxer-wm>	 FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:49:38] <jinxer-wm>	 FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:59:38] <jinxer-wm>	 FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:03:40] <ottomata>	 Hi, just FYI i'm going to do a stream config deployment...
[19:03:43] <ottomata>	 seems clear!
[19:04:37] <jinxer-wm>	 FIRING: [15x] CertAlmostExpired: Certificate for service contint1002:1443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:04:54] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:05:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276699 (https://phabricator.wikimedia.org/T417694) (owner: 10Xcollazo)
[19:05:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276397 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata)
[19:06:05] <wikibugs>	 (03Merged) 10jenkins-bot: Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276699 (https://phabricator.wikimedia.org/T417694) (owner: 10Xcollazo)
[19:06:16] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig - add rc0 streams for html and feature count change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276397 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata)
[19:06:32] <logmsgbot>	 !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1276699|Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' (T417694)]], [[gerrit:1276397|EventStreamConfig - add rc0 streams for html and feature count change (T423920)]]
[19:06:43] <stashbot>	 T417694: Perform a one-time clean up of retained data sets in event_sanitize - https://phabricator.wikimedia.org/T417694
[19:06:44] <stashbot>	 T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920
[19:06:52] <jasmine_>	 !log “ran homer on lsw1-c7-codfw and lsw1-b2-codfw following new control planes (T390861)"
[19:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:58] <stashbot>	 T390861: wikikube-ctrl200[4-5] implementation tracking - https://phabricator.wikimedia.org/T390861
[19:09:37] <logmsgbot>	 !log jasmine@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl[2004-2005].codfw.wmnet
[19:09:39] <logmsgbot>	 !log jasmine@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl[2004-2005].codfw.wmnet
[19:14:07] <logmsgbot>	 !log otto@deploy1003 xcollazo, otto: Backport for [[gerrit:1276699|Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' (T417694)]], [[gerrit:1276397|EventStreamConfig - add rc0 streams for html and feature count change (T423920)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:14:14] <stashbot>	 T417694: Perform a one-time clean up of retained data sets in event_sanitize - https://phabricator.wikimedia.org/T417694
[19:14:15] <stashbot>	 T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920
[19:14:38] <jinxer-wm>	 FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:18:07] <ottomata>	 Hi, spiderpig / scap seems to be failing. I am getting:
[19:18:07] <ottomata>	   Error: Failed to get release next in namespace mw-debug: exit status 1: Error: Kubernetes cluster unreachable: Get "https://kubemaster.svc.eqiad.wmnet:6443/version": dial tcp 10.2.2.8:6443: connect: connection refused
[19:18:24] <ottomata>	 I'll ask in slack too...
[19:20:25] <rzl>	 ottomata: curious, have you retried? that endpoint is Working For Me
[19:21:33] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] alert: mw-page-html-content-change-enrich (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1276648 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton)
[19:21:55] <ottomata>	 I did retry yeah
[19:21:57] <ottomata>	 i can try again?
[19:22:06] <ottomata>	 https://spiderpig.wikimedia.org/jobs/1820
[19:23:01] <ottomata>	 oh wait! the retry succeeded!
[19:23:18] <ottomata>	 the logs keep reprinting so I thought it was new output about the failure
[19:23:19] <rzl>	 yeah I was gonna say :) not a spiderpig expert but that looks like it went through
[19:24:08] <rzl>	 so I think you're ready to check on mw-debug and then keep rolling when you're ready
[19:24:50] <logmsgbot>	 !log otto@deploy1003 xcollazo, otto: Continuing with deployment
[19:25:01] <ottomata>	 yup, thank you, sorry for the noise
[19:25:11] <rzl>	 all good! sorry for the hiccup
[19:28:37] <logmsgbot>	 !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276699|Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' (T417694)]], [[gerrit:1276397|EventStreamConfig - add rc0 streams for html and feature count change (T423920)]] (duration: 22m 05s)
[19:28:42] <stashbot>	 T417694: Perform a one-time clean up of retained data sets in event_sanitize - https://phabricator.wikimedia.org/T417694
[19:28:43] <stashbot>	 T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920
[19:32:01] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] alerts: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton)
[19:32:51] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Confirmed out of band that kafka3.7 had been deployed to whole cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1276745 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron)
[19:34:03] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:34:38] <jinxer-wm>	 FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:44:38] <jinxer-wm>	 FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:52:49] <wikibugs>	 (03PS1) 10TChin: [eventstreams] Bump to v0.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276779 (https://phabricator.wikimedia.org/T420257)
[19:59:38] <jinxer-wm>	 FIRING: [4x] CertAlmostExpired: Certificate for service wdqs1014:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T2000)
[20:00:05] <jouncebot>	 Krinkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:08:06] <rzl>	 ebernhardson: heyo, it looks like you have an mwscript-k8s Metastore.php run from Monday that never got started -- it's wedged in a bad state so I'm just going to delete it, but wanted to check first, do you still need any information off it before I do that?
[20:09:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:09:38] <jinxer-wm>	 FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[20:14:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:14:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:24:19] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11852980 (10VRiley-WMF) a:03VRiley-WMF
[20:31:13] <cscott>	 is backport window still rolling?  i have one more config patch i'd like to squeeze in
[20:34:58] <wikibugs>	 (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to banwiki/ganwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276786 (https://phabricator.wikimedia.org/T423785)
[20:35:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276786 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian)
[20:35:40] <cscott>	 Krinkle: did you deploy your patch?
[20:43:17] <cscott>	 ok, i'm going to jump in and deploy my config change
[20:44:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276786 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian)
[20:45:06] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to banwiki/ganwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276786 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian)
[20:45:22] <logmsgbot>	 !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1276786|Deploy Parsoid Read Views to banwiki/ganwiki (T423785)]]
[20:45:26] <stashbot>	 T423785: Parsoid Read Views to deploy ~2026-04-20 (Language Converter wikis) - https://phabricator.wikimedia.org/T423785
[20:47:01] <logmsgbot>	 !log cscott@deploy1003 cscott: Backport for [[gerrit:1276786|Deploy Parsoid Read Views to banwiki/ganwiki (T423785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:47:22] <Krinkle>	 cscott: I did not, sorry. That's fine yeah.
[20:47:39] <logmsgbot>	 !log cscott@deploy1003 cscott: Continuing with deployment
[20:48:55] <Krinkle>	 I'll let mine ride the 10min of CI meanwhile.
[20:48:58] <wikibugs>	 (03CR) 10Krinkle: [C:03+2] ext.wikiEditor: Set background-size for toolbar buttons [extensions/WikiEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276727 (https://phabricator.wikimedia.org/T414805) (owner: 10Krinkle)
[20:51:25] <logmsgbot>	 !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276786|Deploy Parsoid Read Views to banwiki/ganwiki (T423785)]] (duration: 06m 02s)
[20:51:31] <stashbot>	 T423785: Parsoid Read Views to deploy ~2026-04-20 (Language Converter wikis) - https://phabricator.wikimedia.org/T423785
[20:54:38] <jinxer-wm>	 FIRING: [6x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[20:55:07] <cscott>	 Krinkle: over to you
[20:56:21] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: make UncoreFrequency dynamic for iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1275889 (https://phabricator.wikimedia.org/T418899) (owner: 10Elukey)
[20:56:36] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1273792 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[20:59:25] <Krinkle>	 thx
[20:59:48] <jinxer-wm>	 FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[21:00:04] <jouncebot>	 Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260423T2100)
[21:00:05] <wikibugs>	 (03Merged) 10jenkins-bot: ext.wikiEditor: Set background-size for toolbar buttons [extensions/WikiEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276727 (https://phabricator.wikimedia.org/T414805) (owner: 10Krinkle)
[21:00:49] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]]
[21:00:52] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[21:01:18] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Remove puppetmaster::gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/1273790 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[21:02:26] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:03:27] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Rolling back deployment
[21:03:49] <Krinkle>	 what?
[21:03:54] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]] (duration: 03m 05s)
[21:04:07] <Krinkle>	 Oh, default [n], no mention of "y"
[21:04:11] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]]
[21:04:12] <Krinkle>	 I just pressed enter
[21:04:21] <Krinkle>	 I see it now, a few lines up
[21:04:31] <Krinkle>	 Wee, that's new :) I'll try to remember that next time
[21:05:49] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:05:56] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] nf_conntrack_buckets: use default value [puppet] - 10https://gerrit.wikimedia.org/r/1272774 (https://phabricator.wikimedia.org/T105307) (owner: 10JHathaway)
[21:05:58] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[21:06:12] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with deployment
[21:09:38] <jinxer-wm>	 FIRING: [8x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[21:09:58] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276727|ext.wikiEditor: Set background-size for toolbar buttons (T414805)]] (duration: 05m 47s)
[21:10:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:11:10] <wikibugs>	 (03PS1) 10AKhatun: topic: mw-page-html-feature-counts-change-enrich and -next [puppet] - 10https://gerrit.wikimedia.org/r/1276794 (https://phabricator.wikimedia.org/T424223)
[21:15:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:20:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:23:44] <rzl>	 jhathaway: I don't immediately see why your patch would break puppet, but the timing lines up, do you see anything? ^
[21:24:06] <jhathaway>	 rzl: thanks, let me look
[21:25:45] <rzl>	 ah yeah I just looked at the wrong couple of hosts with unrelated failures -- now I do see a lot of "Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/base/firewall/nf_conntrack.conf"
[21:26:07] <jhathaway>	 definitely me
[21:26:08] <rzl>	 lmk if you need anything
[21:28:03] <jhathaway>	 hmm, tried on bast4006 and it didn't throw an error on a manual run, hmm, strange
[21:29:34] <rzl>	 maybe a missing-dependency thing where it succeeds on the second run?
[21:30:01] <jhathaway>	 yeah, i'm going to try a second run on the failed hosts...
[21:32:20] <wikibugs>	 (03PS1) 10Bking: cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1276804 (https://phabricator.wikimedia.org/T422860)
[21:32:31] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276804 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[21:34:58] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1276804 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[21:35:14] <wikibugs>	 (03CR) 10Bking: [C:03+2] cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1276804 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[21:36:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1011.eqiad.wmnet with OS trixie
[21:39:39] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 293 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1240, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 261, delayed_unassigned_shards
[21:39:39] <icinga-wm>	 ber_of_pending_tasks: 14, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10146, active_shards_percent_as_number: 80.88714938030006 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:39:39] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 293 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1240, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 261, delayed_unassigned_shards
[21:39:39] <icinga-wm>	 ber_of_pending_tasks: 14, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10148, active_shards_percent_as_number: 80.88714938030006 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:39:39] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 293 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1240, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 261, delayed_unassigned_shards
[21:39:39] <icinga-wm>	 ber_of_pending_tasks: 14, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10198, active_shards_percent_as_number: 80.88714938030006 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:39:39] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 293 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1240, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 261, delayed_unassigned_shards
[21:39:40] <icinga-wm>	 ber_of_pending_tasks: 9, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10234, active_shards_percent_as_number: 80.88714938030006 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:39:41] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 293 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1240, relocating_shards: 0, initializing_shards: 32, unassigned_shar
[21:39:41] <icinga-wm>	  delayed_unassigned_shards: 0, number_of_pending_tasks: 10, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 11333, active_shards_percent_as_number: 80.88714938030006 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:40:31] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 268 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1364, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 261, delayed_unassigned_shards:
[21:40:31] <icinga-wm>	 er_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 82, active_shards_percent_as_number: 83.57843137254902 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:40:31] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 268 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1364, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 261, delayed_unassigned_shards:
[21:40:31] <icinga-wm>	 er_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 82, active_shards_percent_as_number: 83.57843137254902 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:40:31] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 257 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1394, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 249, delayed_unassigned_shard
[21:40:31] <icinga-wm>	 mber_of_pending_tasks: 6, number_of_in_flight_fetch: 5, task_max_waiting_in_queue_millis: 37047, active_shards_percent_as_number: 84.43367655966081 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:40:39] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 263 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1369, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 257, delayed_unassigned_shards:
[21:40:39] <icinga-wm>	 er_of_pending_tasks: 5, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 330, active_shards_percent_as_number: 83.88480392156863 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:40:39] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 263 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1369, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 256, delayed_unassigned_shards:
[21:40:39] <icinga-wm>	 er_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 155, active_shards_percent_as_number: 83.88480392156863 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:40:41] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 263 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1369, relocating_shards: 0, initializing_shards: 8, unassigned_shard
[21:40:41] <icinga-wm>	 delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 31, active_shards_percent_as_number: 83.88480392156863 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:40:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:41:30] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1407, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 217, delayed_unassigned_shards: 0, number_of_pending_ta
[21:41:30] <icinga-wm>	 number_of_in_flight_fetch: 5, task_max_waiting_in_queue_millis: 36731, active_shards_percent_as_number: 86.21323529411765 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:41:30] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1407, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 217, delayed_unassigned_shards: 0, number_of_pending_ta
[21:41:30] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 36765, active_shards_percent_as_number: 86.21323529411765 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:41:30] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1451, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 193, delayed_unassigned_shards: 0, number_of_pendin
[21:41:31] <icinga-wm>	  8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 42834, active_shards_percent_as_number: 87.88612961841308 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:41:40] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1412, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_ta
[21:41:40] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 45718, active_shards_percent_as_number: 86.51960784313727 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:41:40] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 816, active_shards: 1412, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_ta
[21:41:40] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 45751, active_shards_percent_as_number: 86.51960784313727 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:41:40] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1413, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 212, delayed_unassign
[21:41:40] <icinga-wm>	 s: 0, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 46823, active_shards_percent_as_number: 86.58088235294117 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:43:45] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[21:43:52] <jhathaway>	 rzl: it does clear on the second run
[21:43:58] <rzl>	 aha
[21:44:38] <jinxer-wm>	 FIRING: [8x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[21:45:40] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1309, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 212, delayed_unassigned_shards: 0, number_of_pending_t
[21:45:40] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38812785388129 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:45:40] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1309, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 212, delayed_unassigned_shards: 0, number_of_pending_t
[21:45:40] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38812785388129 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:45:40] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1309, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 212, delayed_unassigned_shards: 0, number_of_pending_t
[21:45:40] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38812785388129 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:45:40] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1309, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 212, delayed_unassigned_shards: 0, number_of_pending_t
[21:45:41] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38812785388129 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:45:41] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1309, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 212, delayed_unassig
[21:45:42] <icinga-wm>	 ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38812785388129 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:47:30] <wikibugs>	 (03PS1) 10AKhatun: stream: mediawiki.page_html_feature_counts_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223)
[21:48:17] <wikibugs>	 (03PS1) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168)
[21:48:18] <wikibugs>	 (03PS1) 10Aaron Schulz: Add wikibase.v1 module to the sandbox were it is present [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276814 (https://phabricator.wikimedia.org/T422403)
[21:48:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) (owner: 10BryanDavis)
[21:48:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1011.eqiad.wmnet with reason: host reimage
[21:49:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] stream: mediawiki.page_html_feature_counts_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun)
[21:52:25] <wikibugs>	 (03PS2) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168)
[21:54:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1011.eqiad.wmnet with reason: host reimage
[21:58:27] <wikibugs>	 (03PS2) 10AKhatun: stream: mediawiki.page_html_feature_counts_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223)
[21:59:24] <wikibugs>	 (03CR) 10AKhatun: stream: mediawiki.page_html_feature_counts_change (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276812 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun)
[22:03:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[22:04:32] <wikibugs>	 (03PS1) 10Bking: cloudelastic: set role-level hiera for OpenSearch 2/Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1276818 (https://phabricator.wikimedia.org/T422860)
[22:04:56] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276818 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[22:07:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb2013 to codfw - jhancock@cumin2002"
[22:08:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb2013 to codfw - jhancock@cumin2002"
[22:08:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:12:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host rdb2013
[22:12:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb2013
[22:12:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host rdb2014
[22:13:05] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb2014
[22:13:48] <wikibugs>	 (03PS1) 10Ladsgroup: QuickView: Fix relying on non-standard sizes [extensions/MediaSearch] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276819 (https://phabricator.wikimedia.org/T424032)
[22:14:09] <Amir1>	 jouncebot: nowandnext
[22:14:09] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 45 minute(s)
[22:14:09] <jouncebot>	 In 7 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260424T0600)
[22:14:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:14:16] <Amir1>	 noice noice
[22:14:38] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:14:38] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:14:38] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:14:40] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:14:40] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:15:38] <icinga-wm>	 PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:15:40] <icinga-wm>	 PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/MediaSearch] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276819 (https://phabricator.wikimedia.org/T424032) (owner: 10Ladsgroup)
[22:18:41] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:21:23] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:21:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1011.eqiad.wmnet with OS trixie
[22:21:41] <wikibugs>	 (03PS1) 10Andrew Bogott: setup_capi.sh.erb: update to resemble upstream guides for magnum-capi [puppet] - 10https://gerrit.wikimedia.org/r/1276820
[22:21:42] <wikibugs>	 (03PS1) 10Andrew Bogott: Magnum: switch codfw1dev from capi-helm to magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1276821 (https://phabricator.wikimedia.org/T393782)
[22:22:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Magnum: switch codfw1dev from capi-helm to magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1276821 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott)
[22:24:20] <wikibugs>	 (03Merged) 10jenkins-bot: QuickView: Fix relying on non-standard sizes [extensions/MediaSearch] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276819 (https://phabricator.wikimedia.org/T424032) (owner: 10Ladsgroup)
[22:24:37] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1276819|QuickView: Fix relying on non-standard sizes (T424032)]]
[22:24:41] <stashbot>	 T424032: MediaSearch results does not use the standard thumbnail sizes - https://phabricator.wikimedia.org/T424032
[22:26:10] <wikibugs>	 (03PS2) 10Andrew Bogott: Magnum: switch codfw1dev from capi-helm to magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1276821 (https://phabricator.wikimedia.org/T393782)
[22:26:14] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1276819|QuickView: Fix relying on non-standard sizes (T424032)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:26:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:27:06] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:27:48] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:27:49] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276821 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott)
[22:28:07] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment
[22:28:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:28:56] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:31:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[22:31:55] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276819|QuickView: Fix relying on non-standard sizes (T424032)]] (duration: 07m 19s)
[22:31:59] <stashbot>	 T424032: MediaSearch results does not use the standard thumbnail sizes - https://phabricator.wikimedia.org/T424032
[22:34:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:37:58] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:38:23] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[22:38:24] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:38:31] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[22:45:01] <logmsgbot>	 jhancock@cumin2002 provision (PID 3589026) is awaiting input
[22:47:05] <wikibugs>	 (03PS6) 10Cwhite: rsyslog: Move parts of TLS setup into profile::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[22:48:09] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:48:31] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[22:49:38] <jinxer-wm>	 FIRING: [6x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:51:37] <wikibugs>	 (03PS3) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168)
[22:52:54] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Nice find!" [puppet] - 10https://gerrit.wikimedia.org/r/1273926 (owner: 10CDanis)
[22:54:38] <jinxer-wm>	 FIRING: [6x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:58:31] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[23:00:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[23:00:57] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[23:04:52] <jinxer-wm>	 FIRING: [15x] CertAlmostExpired: Certificate for service contint1002:1443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[23:05:09] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] "PCC OK https://puppet-compiler.wmflabs.org/output/1276645/8458/" [puppet] - 10https://gerrit.wikimedia.org/r/1276645 (https://phabricator.wikimedia.org/T424204) (owner: 10Muehlenhoff)
[23:05:09] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[23:26:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[23:28:43] <wikibugs>	 (03CR) 10Scott French: "Thanks, Chris!" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[23:31:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[23:31:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns
[23:34:03] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[23:38:27] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb2013.codfw.wmnet with OS trixie
[23:38:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11853494 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb2013.codfw.wmnet with OS trixie
[23:38:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb2014.codfw.wmnet with OS trixie
[23:38:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11853495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb2014.codfw.wmnet with OS trixie
[23:40:00] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1276828
[23:40:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1276828 (owner: 10TrainBranchBot)
[23:50:28] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1276828 (owner: 10TrainBranchBot)
[23:54:38] <jinxer-wm>	 FIRING: [4x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[23:55:34] <logmsgbot>	 jhancock@cumin2002 reimage (PID 3624979) is awaiting input
[23:59:38] <jinxer-wm>	 FIRING: [4x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire  - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired