[00:24:36] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[00:24:40] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[00:33:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[00:34:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[00:45:45] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[00:45:45] <jinxer-wm>	 Deployment linkrecommendation-internal in linkrecommendation at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ...
[00:45:45] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[02:09:17] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:25:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11837345 (10jhathaway) @MoritzMuehlenhoff I tried to reproduce the issue on Friday afternoon, but I was unable to trigger it with simulated loads via cumin. I rat...
[02:34:17] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:43:23] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:44:51] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:48:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[03:05:52] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[03:05:57] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[03:14:03] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[03:19:37] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[03:19:42] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[03:24:03] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[03:30:33] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[03:39:29] <wikibugs>	 06SRE, 06Traffic: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11837399 (10Naruse_shiroha) Any update on this after one month...?
[04:10:33] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[04:56:48] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11837462 (10Marostegui) Amazing thank you!
[05:08:28] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11837463 (10Marostegui) 05Resolved→03Open @Jhancock.wm unfortunately pc2022, pc2023 and pc2024 have the wrong RAID. They should have RAID10 but they have RAID 0 pc2021 is corre...
[05:09:47] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11837466 (10Marostegui) @VRiley-WMF please note these hosts require RAID 10 (just saying cause there were some config confusion in codfw and they ended with RAID 0 instead).
[05:17:26] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:25:57] <wikibugs>	 (03PS1) 10Marostegui: db2151: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275246
[05:27:26] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 34.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:31:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2151: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275246 (owner: 10Marostegui)
[05:32:23] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2151.codfw.wmnet with reason: Reimage to Trixie
[05:32:28] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2151: Reimage to Trixie
[05:32:50] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2151: Reimage to Trixie
[05:33:36] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2151.codfw.wmnet with OS trixie
[05:47:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:53:36] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2151.codfw.wmnet with reason: host reimage
[05:59:52] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2151.codfw.wmnet with reason: host reimage
[06:06:20] <marostegui>	 !log Removed categorylinks_icu72 from s1 and s6 T422546
[06:06:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:24] <stashbot>	 T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546
[06:10:06] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2151: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275248
[06:11:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2151: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275248 (owner: 10Marostegui)
[06:22:26] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2151.codfw.wmnet with OS trixie
[06:22:57] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2151: after reimage to trixie
[06:26:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove bast1003 from list of bastions [puppet] - 10https://gerrit.wikimedia.org/r/1273413 (owner: 10Muehlenhoff)
[06:26:53] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2151: after reimage to trixie
[06:26:59] <wikibugs>	 (03CR) 10Arnaudb: gerrit: update sync-instances cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb)
[06:27:12] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2151: repool after maintenance
[06:30:56] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1275249 (https://phabricator.wikimedia.org/T423837)
[06:35:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2214 with weight 0 T423837', diff saved to https://phabricator.wikimedia.org/P91127 and previous config saved to /var/cache/conftool/dbconfig/20260420-063553-marostegui.json
[06:35:59] <stashbot>	 T423837: Switchover s6 master (db2229 -> db2214) - https://phabricator.wikimedia.org/T423837
[06:36:02] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 21 hosts with reason: Primary switchover s6 T423837
[06:36:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[06:36:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1275249 (https://phabricator.wikimedia.org/T423837) (owner: 10Gerrit maintenance bot)
[06:39:43] <marostegui>	 !log Starting s6 codfw failover from db2229 to db2214 - T423837
[06:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:07] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2214 to s6 primary T423837', diff saved to https://phabricator.wikimedia.org/P91128 and previous config saved to /var/cache/conftool/dbconfig/20260420-064006-marostegui.json
[06:40:43] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2229 T423837', diff saved to https://phabricator.wikimedia.org/P91129 and previous config saved to /var/cache/conftool/dbconfig/20260420-064042-marostegui.json
[06:40:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bast5004 from list of active bastions [puppet] - 10https://gerrit.wikimedia.org/r/1275250 (https://phabricator.wikimedia.org/T421863)
[06:41:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[06:43:06] <icinga-wm>	 PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator
[06:43:23] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:45:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[06:48:26] <wikibugs>	 (03PS1) 10Marostegui: db2229: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275252
[06:49:37] <icinga-wm>	 ACKNOWLEDGEMENT - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: Marostegui Host will be decommed https://wikitech.wikimedia.org/wiki/Orchestrator
[06:49:40] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2229: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275252 (owner: 10Marostegui)
[06:50:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[06:50:24] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2229.codfw.wmnet with reason: Reimage to Trixie
[06:50:30] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2229: Reimage to Trixie
[06:50:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, re: pontoon changes, they are always safe wrt breaking production (i.e. they can't)" [puppet] - 10https://gerrit.wikimedia.org/r/1273833 (owner: 10Andrew Bogott)
[06:50:38] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2229: Reimage to Trixie
[06:51:06] <wikibugs>	 (03PS1) 10Muehlenhoff: firewall::service: Add a new parameter public_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804)
[06:52:06] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2229.codfw.wmnet with OS trixie
[06:52:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[06:52:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804#11837619 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[06:57:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[06:57:31] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2151: repool after maintenance
[06:59:09] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2229: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275254
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:07:00] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[07:07:22] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[07:07:29] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T419635)', diff saved to https://phabricator.wikimedia.org/P91132 and previous config saved to /var/cache/conftool/dbconfig/20260420-070728-fceratto.json
[07:07:42] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[07:09:41] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T419635)', diff saved to https://phabricator.wikimedia.org/P91133 and previous config saved to /var/cache/conftool/dbconfig/20260420-070941-fceratto.json
[07:10:31] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2229.codfw.wmnet with reason: host reimage
[07:12:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11837649 (10elukey) @herron I would change a thing - I think it is sufficient to upgrade a single host (like https://gerrit....
[07:14:56] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2229.codfw.wmnet with reason: host reimage
[07:15:27] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data: align config [puppet] - 10https://gerrit.wikimedia.org/r/1273658 (owner: 10Slyngshede)
[07:16:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Switch Cloud VPS to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1273441 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff)
[07:19:50] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P91134 and previous config saved to /var/cache/conftool/dbconfig/20260420-071949-fceratto.json
[07:20:35] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Remove bast5004 from list of active bastions [puppet] - 10https://gerrit.wikimedia.org/r/1275250 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[07:26:24] <wikibugs>	 (03CR) 10Ayounsi: "Idea lgtm, can you just run PCC on a random host to make sure it's a real NOOP ?" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[07:27:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2229: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275254 (owner: 10Marostegui)
[07:29:58] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P91135 and previous config saved to /var/cache/conftool/dbconfig/20260420-072957-fceratto.json
[07:30:01] <marostegui>	 !log Removed categorylinks_icu72 from s12 T422546
[07:30:03] <marostegui>	 !log Removed categorylinks_icu72 from s2 T422546
[07:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:04] <stashbot>	 T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546
[07:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:15] <marostegui>	 !log Removed categorylinks_icu72 from s7 T422546
[07:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove bast5004 from list of active bastions [puppet] - 10https://gerrit.wikimedia.org/r/1275250 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[07:38:14] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2229.codfw.wmnet with OS trixie
[07:40:06] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T419635)', diff saved to https://phabricator.wikimedia.org/P91136 and previous config saved to /var/cache/conftool/dbconfig/20260420-074005-fceratto.json
[07:40:10] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[07:40:24] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[07:40:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T419635)', diff saved to https://phabricator.wikimedia.org/P91137 and previous config saved to /var/cache/conftool/dbconfig/20260420-074031-fceratto.json
[07:41:05] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2229: after reimage to trixie
[07:44:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Update IP resolve spec test to use bast1004 instead of bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/1275257 (https://phabricator.wikimedia.org/T423673)
[07:47:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bast5004 [puppet] - 10https://gerrit.wikimedia.org/r/1275258 (https://phabricator.wikimedia.org/T421863)
[07:47:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[07:48:27] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Overall lgtm, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[07:49:20] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Remove bast5004 [puppet] - 10https://gerrit.wikimedia.org/r/1275258 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[07:51:08] <marostegui>	 !log Removed categorylinks_icu72 from s5 T422546
[07:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:12] <stashbot>	 T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546
[07:52:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[07:55:25] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T419635)', diff saved to https://phabricator.wikimedia.org/P91139 and previous config saved to /var/cache/conftool/dbconfig/20260420-075524-fceratto.json
[07:55:29] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[07:57:34] <wikibugs>	 (03CR) 10Klausman: [C:03+1] istio: revisit Prometheus buckets for Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[07:59:02] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 12389
[07:59:56] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 12389
[08:01:00] <marostegui>	 !log Removed categorylinks_icu72 from s3 with a sleep, this will around 1.5 hours T422546
[08:01:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:06] <stashbot>	 T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546
[08:02:19] <wikibugs>	 (03PS1) 10MVernon: apus: move eqiad controller moss-be1001 -> apus-be1005 [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901)
[08:04:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast5004.wikimedia.org
[08:05:22] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[08:05:30] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T419961)', diff saved to https://phabricator.wikimedia.org/P91141 and previous config saved to /var/cache/conftool/dbconfig/20260420-080529-fceratto.json
[08:05:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P91142 and previous config saved to /var/cache/conftool/dbconfig/20260420-080539-fceratto.json
[08:06:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[08:07:05] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM cloudcumin2001.codfw.wmnet
[08:07:05] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11837790 (10MoritzMuehlenhoff)
[08:07:27] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837791 (10ops-monitoring-bot) VM cloudcumin2001.codfw.wmnet rebooted by filippo@cumin1003 with reason: None
[08:09:17] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:09:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:11:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[08:13:06] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cloudcumin2001.codfw.wmnet
[08:14:02] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2188.codfw.wmnet
[08:14:17] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419961)', diff saved to https://phabricator.wikimedia.org/P91144 and previous config saved to /var/cache/conftool/dbconfig/20260420-081416-fceratto.json
[08:14:37] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2188.codfw.wmnet
[08:14:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11837828 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 depool for host wikikube-worker2188.codfw.wmnet completed: - wikikube-worker21...
[08:15:16] <logmsgbot>	 jmm@cumin2002 decommission (PID 198689) is awaiting input
[08:15:19] <logmsgbot>	 !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on wikikube-worker2188.codfw.wmnet with reason: dcops intervention
[08:15:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11837829 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=763d93ce-3c2a-432a-9965-5b1307189ea7) set by cgoubert@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their...
[08:15:35] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM cloudcumin1001.eqiad.wmnet
[08:15:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11837832 (10Clement_Goubert) Depooled and downtimed for 30 days, all yours.
[08:15:48] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P91145 and previous config saved to /var/cache/conftool/dbconfig/20260420-081547-fceratto.json
[08:15:56] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837833 (10ops-monitoring-bot) VM cloudcumin1001.eqiad.wmnet rebooted by filippo@cumin1003 with reason: None
[08:17:16] <wikibugs>	 (03PS1) 10Ayounsi: Comment out eqsin Atlas Anchor [puppet] - 10https://gerrit.wikimedia.org/r/1275261 (https://phabricator.wikimedia.org/T421863)
[08:17:43] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837840 (10fgiunchedi)
[08:19:31] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cloudcumin1001.eqiad.wmnet
[08:22:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1275261 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[08:22:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:22:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:22:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:22:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast5004.wikimedia.org
[08:23:10] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11837847 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast5004.wikimedia.org` - bast5004.wikimedia.org (**PASS**)...
[08:23:57] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb1015.eqiad.wmnet,db[1155,1165].eqiad.wmnet with reason: Reimage to Trixie
[08:24:25] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P91146 and previous config saved to /var/cache/conftool/dbconfig/20260420-082424-fceratto.json
[08:24:26] <wikibugs>	 (03PS1) 10Marostegui: db1165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275262
[08:25:56] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T419635)', diff saved to https://phabricator.wikimedia.org/P91147 and previous config saved to /var/cache/conftool/dbconfig/20260420-082555-fceratto.json
[08:26:00] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[08:26:13] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[08:26:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2004.codfw.wmnet
[08:26:30] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2229: after reimage to trixie
[08:26:35] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Comment out eqsin Atlas Anchor [puppet] - 10https://gerrit.wikimedia.org/r/1275261 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[08:26:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275262 (owner: 10Marostegui)
[08:27:06] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1165.eqiad.wmnet with reason: Reimage to Trixie
[08:27:11] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1165: Reimage to Trixie
[08:27:39] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1165: Reimage to Trixie
[08:28:34] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1165.eqiad.wmnet with OS trixie
[08:30:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:30:41] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.decommission for hosts atlas5001.wikimedia.org
[08:30:48] <wikibugs>	 (03CR) 10Federico Ceratto: "The change is also updating the regex "node /^apus-be100[46789]\.eqiad\./ {" making it more selective, is it intended?" [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon)
[08:31:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove bast5004 [puppet] - 10https://gerrit.wikimedia.org/r/1275258 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[08:32:16] <wikibugs>	 (03PS1) 10Marostegui: eqiad.yaml: Add clouddb1024 [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557)
[08:32:26] <wikibugs>	 (03CR) 10MVernon: "Yes - it removes apus-be1005 from it (otherwise it would be matched twice)." [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon)
[08:32:46] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[08:32:48] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts testvm2004.codfw.wmnet
[08:32:55] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837877 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2004.codfw.wmnet` - testvm2004.codfw.wmnet (...
[08:33:28] <wikibugs>	 (03CR) 10Marostegui: "s4 should be ready to start getting in the LB. s6 would be ready tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[08:34:17] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:34:33] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P91150 and previous config saved to /var/cache/conftool/dbconfig/20260420-083432-fceratto.json
[08:34:47] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[08:36:27] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Update IP resolve spec test to use bast1004 instead of bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/1275257 (https://phabricator.wikimedia.org/T423673) (owner: 10Muehlenhoff)
[08:37:50] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837939 (10MoritzMuehlenhoff)
[08:38:08] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:39:00] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[08:39:17] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:39:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2005.codfw.wmnet
[08:39:49] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[08:39:57] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T419635)', diff saved to https://phabricator.wikimedia.org/P91151 and previous config saved to /var/cache/conftool/dbconfig/20260420-083957-fceratto.json
[08:40:01] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[08:40:03] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837945 (10ops-monitoring-bot) VM testvm2005.codfw.wmnet rebooted by jmm@cumin2002 with reason: None
[08:40:29] <wikibugs>	 (03PS1) 10Elukey: profile::cumin: update insetup_role_report.py [puppet] - 10https://gerrit.wikimedia.org/r/1275345
[08:41:05] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003"
[08:41:44] <wikibugs>	 (03CR) 10MVernon: "(to address the lack of apus100[1,2] in the regex - we never used those hostnames (nor will we), because they were called moss-be100[1,2])" [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon)
[08:41:48] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003"
[08:41:48] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:41:49] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts atlas5001.wikimedia.org
[08:41:55] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11837951 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1003 for hosts: `atlas5001.wikimedia.org` - atlas5001.wikimedia.org (**WARN**)   - //Host not f...
[08:41:58] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1165.eqiad.wmnet with reason: host reimage
[08:42:01] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275348
[08:42:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T419635)', diff saved to https://phabricator.wikimedia.org/P91152 and previous config saved to /var/cache/conftool/dbconfig/20260420-084209-fceratto.json
[08:42:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804#11837956 (10MoritzMuehlenhoff)
[08:43:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2005.codfw.wmnet
[08:44:17] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:44:41] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419961)', diff saved to https://phabricator.wikimedia.org/P91153 and previous config saved to /var/cache/conftool/dbconfig/20260420-084440-fceratto.json
[08:45:04] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
[08:45:14] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T419961)', diff saved to https://phabricator.wikimedia.org/P91154 and previous config saved to /var/cache/conftool/dbconfig/20260420-084512-fceratto.json
[08:47:49] <wikibugs>	 (03CR) 10Blake: "Ah, I meant to rewrite those after adding the runbook, thanks for the catch. I've updated them to instead state impact, rather than sugges" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[08:48:07] <wikibugs>	 (03PS16) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877)
[08:48:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2006.codfw.wmnet
[08:48:37] <wikibugs>	 (03CR) 10FNegri: eqiad.yaml: Add clouddb1024 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[08:49:17] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11837972 (10AnnieKim_WMDE) Uploaded my ssh public key, waiting to be added to groups.
[08:49:39] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1165.eqiad.wmnet with reason: host reimage
[08:50:56] <wikibugs>	 (03CR) 10Marostegui: eqiad.yaml: Add clouddb1024 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[08:51:57] <wikibugs>	 (03PS2) 10Muehlenhoff: Update IP resolve spec test to use bast1004 instead of bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/1275257 (https://phabricator.wikimedia.org/T423673)
[08:52:17] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P91155 and previous config saved to /var/cache/conftool/dbconfig/20260420-085217-fceratto.json
[08:53:08] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:53:53] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419961)', diff saved to https://phabricator.wikimedia.org/P91156 and previous config saved to /var/cache/conftool/dbconfig/20260420-085349-fceratto.json
[08:56:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update IP resolve spec test to use bast1004 instead of bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/1275257 (https://phabricator.wikimedia.org/T423673) (owner: 10Muehlenhoff)
[08:59:12] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] maintain-views: Hide blocks with bl_deleted set to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1273781 (https://phabricator.wikimedia.org/T414188) (owner: 10Dreamy Jazz)
[08:59:36] <wikibugs>	 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11838027 (10DPogorzelski-WMF) 05Open→03Resolved
[09:01:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275348 (owner: 10Marostegui)
[09:01:35] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "perfecto, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[09:01:45] <wikibugs>	 06SRE, 10Lift-Wing, 06Machine-Learning-Team: Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11838029 (10DPogorzelski-WMF)
[09:02:08] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11838031 (10ayounsi)
[09:02:25] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P91157 and previous config saved to /var/cache/conftool/dbconfig/20260420-090225-fceratto.json
[09:04:02] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P91158 and previous config saved to /var/cache/conftool/dbconfig/20260420-090401-fceratto.json
[09:07:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2007.codfw.wmnet
[09:07:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:08:26] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838039 (10ops-monitoring-bot) VM testvm2007.codfw.wmnet rebooted by jmm@cumin2002 with reason: None
[09:10:57] <logmsgbot>	 !log trueg@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:11:01] <logmsgbot>	 !log trueg@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:11:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2007.codfw.wmnet
[09:12:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T419635)', diff saved to https://phabricator.wikimedia.org/P91159 and previous config saved to /var/cache/conftool/dbconfig/20260420-091233-fceratto.json
[09:12:37] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:13:02] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[09:13:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T419635)', diff saved to https://phabricator.wikimedia.org/P91160 and previous config saved to /var/cache/conftool/dbconfig/20260420-091310-fceratto.json
[09:13:19] <wikibugs>	 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11838048 (10isarantopoulos)
[09:13:46] <logmsgbot>	 jmm@cumin2002 decommission (PID 226183) is awaiting input
[09:13:49] <logmsgbot>	 !log trueg@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:13:53] <logmsgbot>	 !log trueg@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:14:08] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1165.eqiad.wmnet with OS trixie
[09:14:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P91161 and previous config saved to /var/cache/conftool/dbconfig/20260420-091409-fceratto.json
[09:15:22] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T419635)', diff saved to https://phabricator.wikimedia.org/P91162 and previous config saved to /var/cache/conftool/dbconfig/20260420-091522-fceratto.json
[09:16:39] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1165: after reimage to trixie
[09:18:09] <wikibugs>	 (03CR) 10Tiziano Fogli: istio: revisit Prometheus buckets for Wikikube (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[09:18:24] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[09:19:13] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:21:15] <Raine>	 !nowandnext
[09:21:24] <Raine>	 jouncebot: nowandnext
[09:21:25] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 38 minute(s)
[09:21:25] <jouncebot>	 In 0 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1000)
[09:21:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2008.wikimedia.org
[09:21:37] <Raine>	 this is getting embarrassing xD 
[09:21:56] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[09:21:57] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838081 (10ops-monitoring-bot) VM testvm2008.wikimedia.org rebooted by jmm@cumin2002 with reason: None
[09:22:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[09:22:25] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Great, thanks! I'll add this to the K8s SIG agenda so the other cluster maintainers can decide whether they would like to route their aler" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[09:23:08] <wikibugs>	 (03CR) 10Blake: [C:03+2] kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[09:24:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[09:24:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:24:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2006.codfw.wmnet
[09:24:17] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838086 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2006.codfw.wmnet` - testvm2006.codfw.wmnet (...
[09:24:18] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419961)', diff saved to https://phabricator.wikimedia.org/P91164 and previous config saved to /var/cache/conftool/dbconfig/20260420-092417-fceratto.json
[09:24:40] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance
[09:24:49] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T419961)', diff saved to https://phabricator.wikimedia.org/P91165 and previous config saved to /var/cache/conftool/dbconfig/20260420-092448-fceratto.json
[09:24:52] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[09:25:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet
[09:25:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2008.wikimedia.org
[09:25:31] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P91166 and previous config saved to /var/cache/conftool/dbconfig/20260420-092530-fceratto.json
[09:25:34] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11838091 (10ops-monitoring-bot) Draining ganeti5006.eqsin.wmnet of running VMs
[09:26:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet
[09:26:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet
[09:26:54] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon)
[09:26:54] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11838092 (10ops-monitoring-bot) Draining ganeti5006.eqsin.wmnet of running VMs
[09:27:40] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-serve: remove excludeIPRanges from cni config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275354 (https://phabricator.wikimedia.org/T419722)
[09:28:01] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838096 (10MoritzMuehlenhoff)
[09:29:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM puppetboard2003.codfw.wmnet
[09:29:46] <wikibugs>	 (03CR) 10MVernon: [C:03+2] apus: move eqiad controller moss-be1001 -> apus-be1005 [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon)
[09:29:59] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838115 (10ops-monitoring-bot) VM puppetboard2003.codfw.wmnet rebooted by jmm@cumin2002 with reason: None
[09:32:57] <wikibugs>	 (03PS1) 10Klausman: ml-services: Allow LLM workloads to work on ml-serve1013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275356
[09:33:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetboard2003.codfw.wmnet
[09:33:38] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419961)', diff saved to https://phabricator.wikimedia.org/P91168 and previous config saved to /var/cache/conftool/dbconfig/20260420-093337-fceratto.json
[09:33:51] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:34:57] <jinxer-wm>	 FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:35:01] <wikibugs>	 (03CR) 10Arthur taylor: Enable and configure WikiProjects prototype on WikiData beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[09:35:05] <logmsgbot>	 !log kamila@deploy1003 Started scap sync-world: ICU 72 upgrade
[09:35:06] <marostegui>	 !ack
[09:35:07] <bjensen>	 !ack
[09:35:07] <sirenbot>	 7855 (ACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[09:35:08] <sirenbot>	 All incidents are already acked.
[09:35:08] <wikibugs>	 (03PS1) 10Phuedx: PHP SDK: Split measurement of unknown experiments [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275357 (https://phabricator.wikimedia.org/T422112)
[09:35:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275357 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[09:35:39] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P91169 and previous config saved to /var/cache/conftool/dbconfig/20260420-093538-fceratto.json
[09:36:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM puppetboard1003.eqiad.wmnet
[09:36:35] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:36:51] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:36:56] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838189 (10ops-monitoring-bot) VM puppetboard1003.eqiad.wmnet rebooted by jmm@cumin2002 with reason: None
[09:37:20] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: Allow LLM workloads to work on ml-serve1013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275356 (owner: 10Klausman)
[09:38:01] <bjensen>	 marostegui: do you happen to have any idea how long we ought to wait to see if this resolves itself?
[09:38:01] <Raine>	 starting ICU 72 upgrade, a bit early so I have enough time to test
[09:38:08] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:38:20] <wikibugs>	 (03CR) 10Klausman: [C:03+2] ml-services: Allow LLM workloads to work on ml-serve1013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275356 (owner: 10Klausman)
[09:38:21] <marostegui>	 bjensen: I don't think we have to assume things would resolve on their own, check -sre
[09:38:27] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update to 1.3.5 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272777 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli)
[09:38:41] <bjensen>	 ah, i was reading from https://wikitech.wikimedia.org/wiki/Thanos#Service_thanos-query:443_has_failed_probes
[09:39:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:40:19] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Allow LLM workloads to work on ml-serve1013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275356 (owner: 10Klausman)
[09:40:20] <marostegui>	 bjensen: I guess we got lucky this, time but in general I'd investigate
[09:40:25] <marostegui>	 which is what I was doing :)
[09:40:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetboard1003.eqiad.wmnet
[09:40:50] <bjensen>	 hm, i would prefer then that the docs not say that the situation often self-resolves
[09:41:01] <wikibugs>	 (03Merged) 10jenkins-bot: mcrouter: update to 1.3.5 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272777 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli)
[09:41:13] <icinga-wm>	 RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator
[09:41:28] <bjensen>	 imo the alerting threshold should be adjusted if there are times where this can fire and we might not need to look at it
[09:41:40] <marostegui>	 bjensen: absolutely yeah
[09:42:03] <marostegui>	 bjensen: maybe we need a task to re-evaluate thresholds there
[09:42:07] <marostegui>	 tappof: would that make sense ^?
[09:42:18] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli)
[09:42:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli)
[09:42:44] <wikibugs>	 (03PS3) 10Effie Mouzeli: mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360)
[09:43:00] <Emperor>	 !log ceph orch host drain moss-be1001 T418901
[09:43:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:03] <stashbot>	 T418901: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901
[09:43:46] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P91170 and previous config saved to /var/cache/conftool/dbconfig/20260420-094345-fceratto.json
[09:44:24] <wikibugs>	 (03PS2) 10Effie Mouzeli: (WIP) update mcrouter module to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273739
[09:45:32] <wikibugs>	 (03PS1) 10Klausman: ml-services: Addendum to "Allow LLM workloads to work on ml-serve1013" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275359
[09:45:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T419635)', diff saved to https://phabricator.wikimedia.org/P91171 and previous config saved to /var/cache/conftool/dbconfig/20260420-094546-fceratto.json
[09:45:51] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:45:53] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] ml-services: Addendum to "Allow LLM workloads to work on ml-serve1013" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275359 (owner: 10Klausman)
[09:46:04] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[09:46:12] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T419635)', diff saved to https://phabricator.wikimedia.org/P91172 and previous config saved to /var/cache/conftool/dbconfig/20260420-094612-fceratto.json
[09:46:35] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:46:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch Cloud VPS to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1273441 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff)
[09:47:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:47:51] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Addendum to "Allow LLM workloads to work on ml-serve1013" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275359 (owner: 10Klausman)
[09:48:20] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[09:48:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T419635)', diff saved to https://phabricator.wikimedia.org/P91174 and previous config saved to /var/cache/conftool/dbconfig/20260420-094823-fceratto.json
[09:49:24] <wikibugs>	 (03CR) 10FNegri: eqiad.yaml: Add clouddb1024 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[09:50:50] <wikibugs>	 (03PS4) 10Effie Mouzeli: mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360)
[09:51:18] <wikibugs>	 (03CR) 10Marostegui: eqiad.yaml: Add clouddb1024 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[09:51:27] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli)
[09:51:39] <wikibugs>	 (03PS1) 10Marostegui: Revert "site.pp: Move clouddb1024 to analytics" [puppet] - 10https://gerrit.wikimedia.org/r/1275364
[09:52:03] <logmsgbot>	 !log kamila@deploy1003 kamila: ICU 72 upgrade synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:53:35] <wikibugs>	 (03PS2) 10Marostegui: eqiad.yaml: Add clouddb1025 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557)
[09:53:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "site.pp: Move clouddb1024 to analytics" [puppet] - 10https://gerrit.wikimedia.org/r/1275364 (owner: 10Marostegui)
[09:53:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275345 (owner: 10Elukey)
[09:53:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P91175 and previous config saved to /var/cache/conftool/dbconfig/20260420-095354-fceratto.json
[09:55:17] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838253 (10MoritzMuehlenhoff)
[09:55:24] <wikibugs>	 (03PS2) 10Marostegui: cloudb1025: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557)
[09:55:56] <wikibugs>	 (03PS3) 10Marostegui: cloudb1025: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557)
[09:56:27] <wikibugs>	 (03CR) 10Marostegui: "Reverted the site.pp patch and updated  https://gerrit.wikimedia.org/r/1273785" [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[09:58:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast1003.wikimedia.org
[09:58:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P91176 and previous config saved to /var/cache/conftool/dbconfig/20260420-095831-fceratto.json
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1000)
[10:02:04] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1165: after reimage to trixie
[10:02:34] <Emperor>	 !log ceph orch host drain moss-be1002 T418901
[10:02:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:38] <stashbot>	 T418901: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901
[10:04:03] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419961)', diff saved to https://phabricator.wikimedia.org/P91178 and previous config saved to /var/cache/conftool/dbconfig/20260420-100402-fceratto.json
[10:04:15] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance
[10:04:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T419961)', diff saved to https://phabricator.wikimedia.org/P91179 and previous config saved to /var/cache/conftool/dbconfig/20260420-100423-fceratto.json
[10:06:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270882 (https://phabricator.wikimedia.org/T417690) (owner: 10D3r1ck01)
[10:07:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:07:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:08:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P91180 and previous config saved to /var/cache/conftool/dbconfig/20260420-100839-fceratto.json
[10:08:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:10:24] <wikibugs>	 (03PS3) 10Effie Mouzeli: mw-mcrouter: update mcrouter module to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273739 (https://phabricator.wikimedia.org/T421360)
[10:13:28] <logmsgbot>	 jmm@cumin2002 decommission (PID 272402) is awaiting input
[10:14:23] <logmsgbot>	 !log kamila@deploy1003 kamila: Continuing with sync
[10:14:42] <wikibugs>	 (03PS1) 10MVernon: hiera: remove two old apus backends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1275366 (https://phabricator.wikimedia.org/T418901)
[10:15:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:17:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bast1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275367 (https://phabricator.wikimedia.org/T423673)
[10:18:48] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T419635)', diff saved to https://phabricator.wikimedia.org/P91181 and previous config saved to /var/cache/conftool/dbconfig/20260420-101847-fceratto.json
[10:18:54] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[10:18:59] <logmsgbot>	 jmm@cumin2002 decommission (PID 272402) is awaiting input
[10:19:05] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[10:19:13] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T419635)', diff saved to https://phabricator.wikimedia.org/P91182 and previous config saved to /var/cache/conftool/dbconfig/20260420-101913-fceratto.json
[10:21:26] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T419635)', diff saved to https://phabricator.wikimedia.org/P91183 and previous config saved to /var/cache/conftool/dbconfig/20260420-102125-fceratto.json
[10:24:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove bast1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275367 (https://phabricator.wikimedia.org/T423673) (owner: 10Muehlenhoff)
[10:24:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:24:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:24:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast1003.wikimedia.org
[10:26:27] <logmsgbot>	 !log kamila@deploy1003 Finished scap sync-world: ICU 72 upgrade (duration: 51m 35s)
[10:27:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11838300 (10VRiley-WMF) a:03VRiley-WMF
[10:31:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P91184 and previous config saved to /var/cache/conftool/dbconfig/20260420-103133-fceratto.json
[10:31:59] <tappof>	 marostegui: bjensen There could be another page, sorry..  While doing a test, I refreshed a dashboard and I think it's the "bad" one.
[10:32:14] <bjensen>	 gotcha, thanks
[10:32:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11838303 (10VRiley-WMF)
[10:32:30] <marostegui>	 tappof: got it thanks
[10:32:52] <logmsgbot>	 !log trueg@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[10:32:56] <logmsgbot>	 !log trueg@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[10:33:08] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:33:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11838306 (10VRiley-WMF) @BTullis Hey Ben, we can replace this cable in order to clear up this error. Can y...
[10:33:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti5006 from ganeti01 eqsin cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275369 (https://phabricator.wikimedia.org/T421863)
[10:36:37] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission bast1003 - https://phabricator.wikimedia.org/T423673#11838327 (10MoritzMuehlenhoff)
[10:38:08] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:41:42] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P91185 and previous config saved to /var/cache/conftool/dbconfig/20260420-104141-fceratto.json
[10:45:43] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[10:47:21] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Confirmed unused in wmf.24:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01)
[10:47:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli)
[10:47:56] <wikibugs>	 (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275372
[10:49:43] <wikibugs>	 (03Merged) 10jenkins-bot: mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli)
[10:50:21] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[10:50:21] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[10:51:08] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Remove ganeti5006 from ganeti01 eqsin cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275369 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[10:51:49] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T419635)', diff saved to https://phabricator.wikimedia.org/P91186 and previous config saved to /var/cache/conftool/dbconfig/20260420-105148-fceratto.json
[10:51:54] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[10:52:06] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[10:52:14] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T419635)', diff saved to https://phabricator.wikimedia.org/P91187 and previous config saved to /var/cache/conftool/dbconfig/20260420-105213-fceratto.json
[10:55:43] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[10:56:15] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11838392 (10VRiley-WMF) Understood, thank you for the heads up! @Marostegui
[11:01:42] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11838401 (10VRiley-WMF)
[11:06:54] <wikibugs>	 (03PS2) 10Muehlenhoff: firewall::service: Add a new parameter public_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804)
[11:08:49] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-mcrouter: bump image and new config (codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360)
[11:10:23] <wikibugs>	 (03PS2) 10Effie Mouzeli: mw-mcrouter: bump image and new config (codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360)
[11:11:29] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[11:16:09] <wikibugs>	 (03PS1) 10JMeybohm: Add wikikube-worker refreshes [puppet] - 10https://gerrit.wikimedia.org/r/1275377 (https://phabricator.wikimedia.org/T418257)
[11:16:15] <wikibugs>	 (03CR) 10Federico Ceratto: "I tested parsercache and worked, not tested depool yet but it's pretty much the same." [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto)
[11:16:49] <wikibugs>	 (03PS3) 10Muehlenhoff: firewall::service: Add a new parameter public_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804)
[11:17:25] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1025.eqiad.wmnet,service=x4
[11:17:55] <wikibugs>	 (03CR) 10FNegri: [C:03+1] cloudb1025: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[11:19:36] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[11:21:13] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: SystemdUnitFailed (instance sretest2010:9100) - https://phabricator.wikimedia.org/T423856 (10LSobanski) 03NEW
[11:24:45] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "I depooled x4 from clouddb1025, this can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[11:27:18] <wikibugs>	 (03CR) 10FNegri: "@marostegui@wikimedia.org this was on hold because of the mariadb issue, but now we can merge it." [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri)
[11:30:57] <wikibugs>	 (03PS4) 10Muehlenhoff: firewall::service: Add a new parameter public_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804)
[11:31:16] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] Add wikikube-worker refreshes [puppet] - 10https://gerrit.wikimedia.org/r/1275377 (https://phabricator.wikimedia.org/T418257) (owner: 10JMeybohm)
[11:33:24] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Add wikikube-worker refreshes [puppet] - 10https://gerrit.wikimedia.org/r/1275377 (https://phabricator.wikimedia.org/T418257) (owner: 10JMeybohm)
[11:35:10] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] sre.mysql.pool: Handle private tasks exception [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto)
[11:36:43] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[11:37:40] <zabe>	 jouncebot: nowandnext
[11:37:40] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 22 minute(s)
[11:37:40] <jouncebot>	 In 1 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1300)
[11:41:21] <wikibugs>	 (03PS5) 10Muehlenhoff: firewall::service: Add a new parameter public_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804)
[11:49:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet
[11:50:24] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[11:52:05] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275366 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon)
[11:52:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T419635)', diff saved to https://phabricator.wikimedia.org/P91188 and previous config saved to /var/cache/conftool/dbconfig/20260420-115231-fceratto.json
[11:52:35] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:53:46] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.pool: Handle private tasks exception [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto)
[11:55:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] eqiad.yaml: Add clouddb1025 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[11:57:51] <wikibugs>	 (03CR) 10Muehlenhoff: "Good thing you prodded me for that, there was actually more things to fix... PCC now added and looking fine." [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[11:59:50] <wikibugs>	 (03CR) 10MVernon: [C:03+2] hiera: remove two old apus backends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1275366 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon)
[12:02:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P91189 and previous config saved to /var/cache/conftool/dbconfig/20260420-120239-fceratto.json
[12:05:06] <wikibugs>	 (03PS1) 10MVernon: preseed: increase size of / for thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/1275384 (https://phabricator.wikimedia.org/T423690)
[12:10:56] <moritzm>	 !log installing edk2 security updates
[12:10:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts moss-be[1001-1002].eqiad.wmnet
[12:12:49] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P91190 and previous config saved to /var/cache/conftool/dbconfig/20260420-121247-fceratto.json
[12:14:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission bast1003 - https://phabricator.wikimedia.org/T423673#11838644 (10VRiley-WMF) a:03VRiley-WMF
[12:15:36] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1328-1334,1360-1374].eqiad.wmnet
[12:15:41] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1328-1334,1360-1374].eqiad.wmnet
[12:16:31] <moritzm>	 !log remove ganeti5006 from eqsin01 Ganeti cluster (running classic Ganeti) T421863
[12:16:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:35] <stashbot>	 T421863: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863
[12:16:44] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1328-1334,1360-1374].eqiad.wmnet - https://phabricator.wikimedia.org/T423863 (10JMeybohm) 03NEW
[12:17:15] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1328-1334,1360-1374].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11838659 (10JMeybohm)
[12:17:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti5006 from ganeti01 eqsin cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275369 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[12:17:32] <zabe>	 !log Deployed patch for T423821
[12:17:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:24] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti5006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[12:19:24] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti5006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[12:20:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission bast1003 - https://phabricator.wikimedia.org/T423673#11838678 (10VRiley-WMF) 05Open→03Resolved
[12:21:32] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11838680 (10JMeybohm)
[12:22:57] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T419635)', diff saved to https://phabricator.wikimedia.org/P91191 and previous config saved to /var/cache/conftool/dbconfig/20260420-122256-fceratto.json
[12:23:03] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:23:08] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:23:14] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[12:23:22] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T419635)', diff saved to https://phabricator.wikimedia.org/P91192 and previous config saved to /var/cache/conftool/dbconfig/20260420-122321-fceratto.json
[12:25:35] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T419635)', diff saved to https://phabricator.wikimedia.org/P91193 and previous config saved to /var/cache/conftool/dbconfig/20260420-122534-fceratto.json
[12:25:53] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275393
[12:26:38] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[12:28:13] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096,1098-1112,1166-1168].eqiad.wmnet
[12:31:07] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-be[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002"
[12:32:55] <wikibugs>	 (03CR) 10Btullis: [C:03+2] maintain-views: Hide blocks with bl_deleted set to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1273781 (https://phabricator.wikimedia.org/T414188) (owner: 10Dreamy Jazz)
[12:34:12] <logmsgbot>	 mvernon@cumin2002 decommission (PID 356509) is awaiting input
[12:35:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P91194 and previous config saved to /var/cache/conftool/dbconfig/20260420-123542-fceratto.json
[12:35:44] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-be[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002"
[12:35:44] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:35:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts moss-be[1001-1002].eqiad.wmnet
[12:35:53] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11838718 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `moss-be[1001-1002].eqiad.wmnet` - moss-be1...
[12:45:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P91195 and previous config saved to /var/cache/conftool/dbconfig/20260420-124550-fceratto.json
[12:53:08] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:53:28] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11838769 (10jijiki) Thank you!
[12:54:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:55:59] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T419635)', diff saved to https://phabricator.wikimedia.org/P91196 and previous config saved to /var/cache/conftool/dbconfig/20260420-125559-fceratto.json
[12:56:06] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:56:17] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[12:56:25] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1236 (T419635)', diff saved to https://phabricator.wikimedia.org/P91197 and previous config saved to /var/cache/conftool/dbconfig/20260420-125624-fceratto.json
[12:58:37] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T419635)', diff saved to https://phabricator.wikimedia.org/P91198 and previous config saved to /var/cache/conftool/dbconfig/20260420-125837-fceratto.json
[12:58:48] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096,1098-1112,1166-1168].eqiad.wmnet
[12:58:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11838793 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node...
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1300). Please do the needful.
[13:00:05] <jouncebot>	 xSavitar, aude, and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <aude>	 hi
[13:02:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:02:42] <xSavitar>	 o/
[13:02:55] <wikibugs>	 (03CR) 10Elukey: [C:03+1] mw-mcrouter: update mcrouter module to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273739 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli)
[13:04:54] <xSavitar>	 Getting an error when trying to get a scap OTP
[13:04:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:04:58] <xSavitar>	 ssh: Could not resolve hostname bast1003.wikimedia.org: nodename nor servname provided, or not known
[13:04:59] <xSavitar>	 Connection closed by UNKNOWN port 65535
[13:05:08] <xSavitar>	 Did anything change of recent?
[13:05:24] <taavi>	 xSavitar: https://lists.wikimedia.org/hyperkitty/list/ops@lists.wikimedia.org/thread/DQ7KFORXBZQX55NR23QHZDNFOSXETLQV/
[13:05:42] <xSavitar>	 taavi, thanks, having a quick read now.
[13:05:49] <wikibugs>	 (03CR) 10Elukey: "Thinking out loud - would it be better to add one option at the time, incrementally? For example, we could start with the 10 timeouts unti" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli)
[13:06:05] <xSavitar>	 taavi, mailing list is private and I'm not subscribed :(
[13:06:10] <wikibugs>	 (03CR) 10Elukey: [C:03+1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275372 (owner: 10Muehlenhoff)
[13:06:15] <Lucas_WMDE>	 o/
[13:06:34] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::cumin: update insetup_role_report.py [puppet] - 10https://gerrit.wikimedia.org/r/1275345 (owner: 10Elukey)
[13:06:43] <xSavitar>	 Lucas_WMDE o/
[13:06:44] <Lucas_WMDE>	 xSavitar: you need bast1004
[13:06:56] <Lucas_WMDE>	 bast1003 was retired (there’s probably a phab task but it’s not linked in the email AFAITC)
[13:06:58] <xSavitar>	 I wanted to self service but I would need some help so that I setup bast1004 later
[13:07:03] <xSavitar>	 Thanks!
[13:07:10] <Lucas_WMDE>	 okay, I can deploy
[13:07:18] <xSavitar>	 Thank you very much!
[13:07:30] <xSavitar>	 It's a no-op config patch. The config setting should be unused now
[13:07:31] <Lucas_WMDE>	 might as well do the two config changes together, I think
[13:07:37] <xSavitar>	 Ack!
[13:07:46] <Lucas_WMDE>	 (FYI aude ^)
[13:08:04] <aude>	 i'm ready
[13:08:25] <Lucas_WMDE>	 hm, scap complains about dependencies *looks*
[13:08:35] <Lucas_WMDE>	 “but the dependency is not present in recent train branch: wmf/1.46.0-wmf.23”
[13:08:43] <Lucas_WMDE>	 “This branch is a likely rollback target” not sure I disagree, it’s Monday
[13:08:45] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P91199 and previous config saved to /var/cache/conftool/dbconfig/20260420-130845-fceratto.json
[13:08:47] <Lucas_WMDE>	 not sure I *agree
[13:09:36] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "I am ok to proceed, but was this tested in staging with a kill/start of a pod etc..? Just to be sure that we are not getting into some wei" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275354 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[13:09:54] <Lucas_WMDE>	 yeah, nah, let’s deploy
[13:09:58] <phuedx>	 o/
[13:10:02] <phuedx>	 Sorry I'm late
[13:10:22] <Lucas_WMDE>	 no problem, we’re starting with the config changes now
[13:10:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01)
[13:10:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273842 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude)
[13:11:27] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused JWT for bot password temporary config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01)
[13:11:44] <xSavitar>	 We need to update spider pig website
[13:11:54] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ReadingLists beta feature for all Wikipedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273842 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude)
[13:12:01] <xSavitar>	 It still references bast1003 to get an OTP
[13:12:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1247960|Remove unused JWT for bot password temporary config (T422367 T415007)]], [[gerrit:1273842|Enable ReadingLists beta feature for all Wikipedia wikis (T420881)]]
[13:12:34] <stashbot>	 T422367: Remove temporary JWT session configuration setting for BotPasswords - https://phabricator.wikimedia.org/T422367
[13:12:35] <stashbot>	 T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007
[13:12:35] <stashbot>	 T420881: [Reading list web beta] Deploy beta feature to all wikipedias - https://phabricator.wikimedia.org/T420881
[13:12:38] <Lucas_WMDE>	 huh, ok
[13:13:18] <taavi>	 on which page exactly?
[13:13:36] <Lucas_WMDE>	 taavi: I can see it in otpHost in https://spiderpig.wikimedia.org/api/whoami
[13:13:46] <Lucas_WMDE>	 (I think that’‘s where web/src/components/LoginPage.vue then gets it from)
[13:13:53] <Lucas_WMDE>	 wait
[13:13:55] <Lucas_WMDE>	 deploy1003, not bast1003
[13:13:58] <xSavitar>	 taavi, after one logs in
[13:14:09] <xSavitar>	 https://spiderpig.wikimedia.org/ (after logging in)
[13:14:24] <xSavitar>	 I'm filing a task about it now
[13:14:28] <Lucas_WMDE>	 xSavitar: are you sure it’s telling you which *bastion* to use? (I got confused between bast1003 and deploy1003 just now)
[13:15:17] <icinga-wm>	 PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100%
[13:15:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, aude, d3r1ck01: Backport for [[gerrit:1247960|Remove unused JWT for bot password temporary config (T422367 T415007)]], [[gerrit:1273842|Enable ReadingLists beta feature for all Wikipedia wikis (T420881)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:15:46] <Lucas_WMDE>	 aude: please test :)
[13:15:47] <aude>	 checking
[13:15:49] <Lucas_WMDE>	 xSavitar: anything to test?
[13:15:51] <xSavitar>	 Lucas_WMDE, ack! you're right.
[13:15:55] <xSavitar>	 Lucas_WMDE, nothing to test.
[13:16:00] <Lucas_WMDE>	 ok
[13:16:41] <aude>	 looks good
[13:16:52] <xSavitar>	 taavi, I don't think I need to file it after all, this looks like my problem to resolve. Thanks! I believe deploy1003 should work.
[13:16:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, aude, d3r1ck01: Continuing with sync
[13:17:00] <Lucas_WMDE>	 ok!
[13:18:43] <xSavitar>	 Lucas_WMDE, taavi, I was able to get the OTP (after adjusting SSH config). Works now, thanks to you both!
[13:18:53] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P91200 and previous config saved to /var/cache/conftool/dbconfig/20260420-131853-fceratto.json
[13:19:35] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275357 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[13:19:39] <Lucas_WMDE>	 xSavitar: okay, great!
[13:19:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1012.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:20:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1247960|Remove unused JWT for bot password temporary config (T422367 T415007)]], [[gerrit:1273842|Enable ReadingLists beta feature for all Wikipedia wikis (T420881)]] (duration: 08m 21s)
[13:20:49] <wikibugs>	 (03CR) 10Marostegui: "Works for me!" [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri)
[13:20:53] <stashbot>	 T422367: Remove temporary JWT session configuration setting for BotPasswords - https://phabricator.wikimedia.org/T422367
[13:20:53] <stashbot>	 T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007
[13:20:54] <stashbot>	 T420881: [Reading list web beta] Deploy beta feature to all wikipedias - https://phabricator.wikimedia.org/T420881
[13:20:56] <aude>	 thanks Lucas_WMDE !
[13:20:58] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch on k8s: Add semantic-search and ipoid to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking)
[13:21:02] <Lucas_WMDE>	 phuedx: want to deploy your backport yourself?
[13:21:06] <wikibugs>	 (03Merged) 10jenkins-bot: PHP SDK: Split measurement of unknown experiments [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275357 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[13:21:06] <Lucas_WMDE>	 np aude :)
[13:21:21] <phuedx>	 Can do
[13:21:27] <Lucas_WMDE>	 alright, go ahead :)
[13:21:41] <aude>	 I suppose there is no train this week
[13:21:52] <xSavitar>	 Lucas_WMDE, thanks for helping aude and me deploy. I appreciate it. 🙏🏽
[13:22:02] <Lucas_WMDE>	 huh, what’s up with the train?
[13:22:05] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1275357|PHP SDK: Split measurement of unknown experiments (T422112)]]
[13:22:09] <stashbot>	 T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112
[13:22:12] <Lucas_WMDE>	 oh, earth day https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar
[13:22:52] <xSavitar>	 Scrolling through https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T0200, it looks like yes, no train this week.
[13:22:57] <aude>	 WMF staff have a holiday on Wednesday (and I do not see the train on the calendar)
[13:23:41] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1275357|PHP SDK: Split measurement of unknown experiments (T422112)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:23:45] <Lucas_WMDE>	 thx
[13:23:45] <icinga-wm>	 RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[13:25:00] <jinxer-wm>	 FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1012:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished  - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished
[13:26:05] <phuedx>	 Quick spot check on enwiki and dewiki LGTM
[13:26:08] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-be100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T423868 (10MatthewVernon) 03NEW
[13:26:08] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Continuing with sync
[13:26:40] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869 (10FCeratto-WMF) 03NEW
[13:29:02] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T419635)', diff saved to https://phabricator.wikimedia.org/P91202 and previous config saved to /var/cache/conftool/dbconfig/20260420-132901-fceratto.json
[13:29:06] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:29:19] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1253.eqiad.wmnet with reason: Maintenance
[13:29:27] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T419635)', diff saved to https://phabricator.wikimedia.org/P91203 and previous config saved to /var/cache/conftool/dbconfig/20260420-132926-fceratto.json
[13:29:50] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: ml-serve1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1012.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:29:56] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275357|PHP SDK: Split measurement of unknown experiments (T422112)]] (duration: 07m 51s)
[13:30:00] <jinxer-wm>	 RESOLVED: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1012:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished  - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished
[13:30:00] <stashbot>	 T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112
[13:30:05] <logmsgbot>	 !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1014.eqiad.wmnet with reason: Decommissioning — T412830
[13:30:09] * phuedx watches logs
[13:30:11] <stashbot>	 T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830
[13:31:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T419635)', diff saved to https://phabricator.wikimedia.org/P91204 and previous config saved to /var/cache/conftool/dbconfig/20260420-133139-fceratto.json
[13:32:34] <urandom>	 decommissioning Cassandra, aqs1014 [a,b] — T412830
[13:32:37] <urandom>	 !log decommissioning Cassandra, aqs1014 [a,b] — T412830
[13:32:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:27] <wikibugs>	 (03PS2) 10Bking: opensearch on k8s: Activate semantic-search and ipoid in services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1272909 (https://phabricator.wikimedia.org/T421293)
[13:33:35] <phuedx>	 Lucas_WMDE: The logs look good. I think that's the end of the window?
[13:34:06] * Lucas_WMDE reloads the calendar
[13:34:08] <Lucas_WMDE>	 looks like it yeah
[13:34:09] <Lucas_WMDE>	 thanks!
[13:34:16] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:33] <phuedx>	 Quick lunch!
[13:35:46] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[13:37:38] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:38:35] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-be100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T423868#11839006 (10Jclark-ctr) a:03Jclark-ctr
[13:40:19] <wikibugs>	 (03PS1) 10Daniel Kinzler: api rate limits: use global apihighlimits-requestor group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275410 (https://phabricator.wikimedia.org/T419796)
[13:41:48] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P91205 and previous config saved to /var/cache/conftool/dbconfig/20260420-134148-fceratto.json
[13:41:51] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[13:41:59] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T419961)', diff saved to https://phabricator.wikimedia.org/P91206 and previous config saved to /var/cache/conftool/dbconfig/20260420-134158-fceratto.json
[13:43:06] <wikibugs>	 (03CR) 10Elukey: ipmi: rework how to use a different user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[13:43:13] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-be100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T423868#11839045 (10Jclark-ctr)
[13:43:18] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-be100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T423868#11839046 (10Jclark-ctr) 05Open→03Resolved
[13:43:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11839049 (10Jclark-ctr) a:05Jclark-ctr→03brouberol
[13:44:16] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbstore1010 to eqiad - jclark@cumin1003"
[13:44:22] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbstore1010 to eqiad - jclark@cumin1003"
[13:44:22] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:44:35] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872 (10phaultfinder) 03NEW
[13:45:02] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host dbstore1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:46:24] <wikibugs>	 10SRE-swift-storage, 06DBA, 10MediaWiki-File-management, 07Regression: Stuck-hidden file / Deleted file revisions displaying improperly - https://phabricator.wikimedia.org/T423065#11839078 (10Bugreporter) >>! In T423065#11837057, @Zabe wrote: > Should be working again. Following up in T423821.  See als...
[13:47:58] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:50:26] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T419961)', diff saved to https://phabricator.wikimedia.org/P91207 and previous config saved to /var/cache/conftool/dbconfig/20260420-135025-fceratto.json
[13:50:50] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:51:56] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P91208 and previous config saved to /var/cache/conftool/dbconfig/20260420-135155-fceratto.json
[13:52:07] <wikibugs>	 (03PS3) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850)
[13:52:32] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869#11839106 (10FCeratto-WMF)
[13:52:36] <zabe>	 jouncebot: nowandnext
[13:52:36] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1300)
[13:52:36] <jouncebot>	 In 0 hour(s) and 37 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1430)
[13:52:47] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[13:52:56] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2152 (T410589)', diff saved to https://phabricator.wikimedia.org/P91209 and previous config saved to /var/cache/conftool/dbconfig/20260420-135255-ladsgroup.json
[13:53:00] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[13:58:50] <wikibugs>	 (03PS1) 10Marostegui: ms2: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275414
[13:59:08] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: Reimage to Trixie
[13:59:21] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11839147 (10VRiley-WMF)
[13:59:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] ms2: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275414 (owner: 10Marostegui)
[14:00:00] <wikibugs>	 (03PS1) 10Elukey: profile::pki::root_ca: create a new discovery intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275416 (https://phabricator.wikimedia.org/T420993)
[14:00:02] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[14:00:02] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1151.eqiad.wmnet with reason: Reimage to Trixie
[14:00:07] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1151: Reimage to Trixie
[14:00:07] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[14:00:14] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275416 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[14:00:15] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[14:00:15] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1151: Reimage to Trixie
[14:00:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P91211 and previous config saved to /var/cache/conftool/dbconfig/20260420-140033-fceratto.json
[14:00:43] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1151.eqiad.wmnet with OS trixie
[14:01:54] <wikibugs>	 (03PS3) 10FNegri: conftool-data: move s3, x3 to new hosts (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557)
[14:02:01] <urandom>	 !log upgrade envoyproxy, restbase — T419637 & T410975
[14:02:04] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T419635)', diff saved to https://phabricator.wikimedia.org/P91212 and previous config saved to /var/cache/conftool/dbconfig/20260420-140203-fceratto.json
[14:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:10] <stashbot>	 T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637
[14:02:12] <stashbot>	 T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975
[14:02:20] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:02:21] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[14:02:28] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::striker: Remove separate monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/1270282 (owner: 10Majavah)
[14:06:57] <wikibugs>	 (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275384 (https://phabricator.wikimedia.org/T423690) (owner: 10MVernon)
[14:07:38] <wikibugs>	 (03CR) 10MVernon: [C:03+2] preseed: increase size of / for thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/1275384 (https://phabricator.wikimedia.org/T423690) (owner: 10MVernon)
[14:09:18] <wikibugs>	 06SRE, 10envoy, 06ServiceOps new, 10ServiceOps-Services-Oids: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11839189 (10Eevans)
[14:10:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P91213 and previous config saved to /var/cache/conftool/dbconfig/20260420-141042-fceratto.json
[14:14:15] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye
[14:14:22] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1151.eqiad.wmnet with reason: host reimage
[14:14:38] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability, 13Patch-For-Review: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11839198 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2006.codfw.wmnet...
[14:15:09] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host dbstore1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:18:45] <wikibugs>	 (03PS1) 10Marostegui: Revert "ms2: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275423
[14:19:17] <wikibugs>	 (03CR) 10FNegri: [C:03+2] "Rebased, merging." [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri)
[14:19:19] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1151.eqiad.wmnet with reason: host reimage
[14:19:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudelastic1012.eqiad.wmnet
[14:20:02] <wikibugs>	 06SRE, 10observability: Observability: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T422816#11839245 (10ayounsi)
[14:20:05] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: SystemdUnitFailed (instance sretest2010:9100) - https://phabricator.wikimedia.org/T423856#11839247 (10jhathaway) p:05Triage→03Medium a:03jhathaway
[14:20:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T419961)', diff saved to https://phabricator.wikimedia.org/P91214 and previous config saved to /var/cache/conftool/dbconfig/20260420-142050-fceratto.json
[14:21:02] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#11839250 (10ayounsi) p:05Triage→03Medium
[14:21:13] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[14:21:21] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T419961)', diff saved to https://phabricator.wikimedia.org/P91215 and previous config saved to /var/cache/conftool/dbconfig/20260420-142120-fceratto.json
[14:21:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudelastic1012.eqiad.wmnet
[14:21:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11839254 (10LSobanski) p:05Triage→03High
[14:22:41] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:22:41] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:22:41] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:23:33] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11839263 (10Scott_French)
[14:26:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11839267 (10Scott_French) @AnnieKim_WMDE - Please see https://wikitech.wikimedia.org/wiki/SRE/Production_access#Access_Request_Process f...
[14:26:52] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[14:26:55] <logmsgbot>	 jclark@cumin1003 provision (PID 3284747) is awaiting input
[14:29:41] <wikibugs>	 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11839301 (10BTullis) >>! In T348935#11834420, @BTullis wrote: > It's worth noting that...
[14:30:04] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1430)
[14:30:33] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869#11839308 (10Jhancock.wm) @FCeratto-WMF got it to boot.  powered off, drained the flea power, and reseated the cables to the backplane.  This error could have been caused by a loose cable....
[14:30:42] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869#11839309 (10Jhancock.wm) a:03Jhancock.wm
[14:30:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11839313 (10Jclark-ctr) a:03Jclark-ctr `    ps1-c6-eqiad.mgmt.eqiad.wmnet #1: Phase, AA:L2-L3, Active Power;  Value: 1662 (power) high: 1650 `
[14:32:41] <wikibugs>	 10ops-codfw, 06SRE, 10Data-Persistence-Misc, 06DC-Ops: db2201 broken DIMM - https://phabricator.wikimedia.org/T423184#11839328 (10Jhancock.wm) 05Open→03Declined
[14:33:09] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869#11839335 (10FCeratto-WMF) Thanks!
[14:33:27] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: move es2036 - https://phabricator.wikimedia.org/T423195#11839337 (10Jhancock.wm)
[14:34:42] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: move es2036 - https://phabricator.wikimedia.org/T423195#11839359 (10Marostegui) Would this need an IP change? It should be fairly easy to get this host depooled, when would you like to get it done?
[14:35:59] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbstore1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:36:53] <wikibugs>	 (03CR) 10Scott French: [C:03+1] backup: Ignore /srv/docker from srv-deployment backups, move cluster mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) (owner: 10Jcrespo)
[14:36:59] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage
[14:37:15] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host dbstore1010.eqiad.wmnet with OS bookworm
[14:37:16] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbstore1010.eqiad.wmnet with OS bookworm
[14:38:04] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869#11839381 (10Jhancock.wm) 05Open→03Resolved
[14:38:52] <wikibugs>	 (03PS1) 10Aude: Limit donate button to Wikipedia wikis (except Finnish) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876)
[14:40:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude)
[14:40:39] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host dbstore1010.eqiad.wmnet with OS bookworm
[14:40:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839409 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host dbstore1010.eqiad.wmnet with OS bookworm
[14:41:07] <wikibugs>	 (03CR) 10Anne Tomasevich: [C:03+1] Limit donate button to Wikipedia wikis (except Finnish) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude)
[14:41:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839411 (10Jclark-ctr) a:03Jclark-ctr
[14:41:27] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: move es2036 - https://phabricator.wikimedia.org/T423195#11839410 (10Jhancock.wm) It does not need an IP change. only needs a few values in netbox updated and running dns cookbook to catch changes. It's going to stay in the same rack. I can do this any day of the week ar...
[14:42:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839413 (10Jclark-ctr)
[14:42:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: sretest2001 has broken psu - https://phabricator.wikimedia.org/T423179#11839414 (10Jhancock.wm) 05Open→03Declined
[14:42:08] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1151.eqiad.wmnet with OS trixie
[14:43:07] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1151: after reimage to trixie
[14:43:07] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db1151: after reimage to trixie
[14:44:26] <wikibugs>	 (03PS1) 10Harroyo-wmf: hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812)
[14:45:09] <wikibugs>	 (03PS1) 10Effie Mouzeli: site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275430 (https://phabricator.wikimedia.org/T418261)
[14:45:23] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage
[14:45:53] <logmsgbot>	 !log cwhite@deploy1003 Started deploy [performance/arc-lamp@bd7b2ab]: T413127
[14:45:57] <stashbot>	 T413127: Directory Listing and Download from Object Storage - https://phabricator.wikimedia.org/T413127
[14:46:02] <logmsgbot>	 !log cwhite@deploy1003 Finished deploy [performance/arc-lamp@bd7b2ab]: T413127 (duration: 00m 08s)
[14:47:49] <wikibugs>	 (03CR) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[14:47:54] <wikibugs>	 (03PS2) 10Effie Mouzeli: site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275430 (https://phabricator.wikimedia.org/T418261)
[14:50:22] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Kindly requesting Kerberos password reset - https://phabricator.wikimedia.org/T423875#11839477 (10ssingh)
[14:51:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2188']
[14:51:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622#11839483 (10Jdforrester-WMF) 05Open→03In progress a:03Jdforrester-WMF
[14:52:02] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[14:57:20] <wikibugs>	 (03PS1) 10JMeybohm: Decom various wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1275433 (https://phabricator.wikimedia.org/T42386)
[14:58:11] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Kindly requesting Kerberos password reset - https://phabricator.wikimedia.org/T423875#11839516 (10ssingh) ` sukhe@krb1002:~$ sudo manage_principals.py reset-password mfischerwmf --email_address=mfischer@wikimedia.org Password reset successfully. Successfully sent emai...
[14:58:27] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Kindly requesting Kerberos password reset - https://phabricator.wikimedia.org/T423875#11839517 (10ssingh) 05Open→03Resolved
[14:58:29] <effie>	 jouncebot: now 
[14:58:29] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1430)
[14:58:35] <effie>	 jouncebot: next 
[14:58:35] <jouncebot>	 In 0 hour(s) and 31 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1530)
[15:03:46] <logmsgbot>	 !log trueg@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[15:03:48] <logmsgbot>	 !log trueg@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[15:04:38] <wikibugs>	 (03PS1) 10Bking: cloudelastic1012: Set LVS config for opensearch_2 [puppet] - 10https://gerrit.wikimedia.org/r/1275435 (https://phabricator.wikimedia.org/T422860)
[15:04:55] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275435 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[15:05:10] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1010.eqiad.wmnet with reason: host reimage
[15:05:39] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2006.codfw.wmnet with OS bullseye
[15:05:47] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11839575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2006.codfw.wmnet with OS bullseye completed...
[15:08:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "ms2: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275423 (owner: 10Marostegui)
[15:08:46] <wikibugs>	 (03CR) 10Bking: [C:03+2] cloudelastic1012: Set LVS config for opensearch_2 [puppet] - 10https://gerrit.wikimedia.org/r/1275435 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[15:09:05] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1010.eqiad.wmnet with reason: host reimage
[15:11:04] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1151: repool after maintenance
[15:11:04] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db1151: repool after maintenance
[15:11:20] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2188']
[15:11:40] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 68424256 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:11:45] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[15:11:45] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[15:12:49] <wikibugs>	 (03PS1) 10Marostegui: es2036: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275437 (https://phabricator.wikimedia.org/T423195)
[15:13:09] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2036: Moving to another rack
[15:13:24] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2036: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275437 (https://phabricator.wikimedia.org/T423195) (owner: 10Marostegui)
[15:13:27] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2036: Moving to another rack
[15:13:40] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2567752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:13:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11839636 (10Jhancock.wm) @Clement_Goubert  did a firmware and bios update. error has cleared. should be good to repool.
[15:14:22] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Moved to anotehr rack
[15:14:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11839638 (10phaultfinder)
[15:16:36] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: move es2036 - https://phabricator.wikimedia.org/T423195#11839658 (10Marostegui) Host off, ready to be moved.
[15:17:32] <wikibugs>	 (03CR) 10Arlolra: [C:03+1] Increase Parsoid Read Views percentage for ruwiki to 55% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274387 (owner: 10C. Scott Ananian)
[15:20:11] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1166: Security update
[15:21:09] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275439 (https://phabricator.wikimedia.org/T414376)
[15:21:45] <jinxer-wm>	 RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[15:21:45] <jinxer-wm>	 RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[15:21:56] <Lucas_WMDE>	 I’ll deploy that ^ wikidata-query-gui bump soon if no one objects
[15:23:07] <wikibugs>	 (03PS1) 10Ottomata: html-enrich - try mw-api-int to get earlier envoy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275441 (https://phabricator.wikimedia.org/T421216)
[15:23:33] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance
[15:23:42] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91217 and previous config saved to /var/cache/conftool/dbconfig/20260420-152341-fceratto.json
[15:23:45] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[15:25:23] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] html-enrich - try mw-api-int to get earlier envoy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275441 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[15:25:46] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db1166: Security update
[15:25:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudelastic1012.eqiad.wmnet
[15:27:02] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11839734 (10MatthewVernon) Thanos-be2006 now looks like: ` Filesystem      Size  Used Avail Use% Mounted on /dev/md0        110G  5.7G   99G   6% / /dev/sdy4...
[15:27:03] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[15:27:50] <wikibugs>	 (03Merged) 10jenkins-bot: html-enrich - try mw-api-int to get earlier envoy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275441 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[15:28:05] <wikibugs>	 (03PS1) 10JMeybohm: Decom various wikikube-workers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275442 (https://phabricator.wikimedia.org/T423863)
[15:28:05] <wikibugs>	 (03CR) 10JavierMonton: [C:03+1] html-enrich - try mw-api-int to get earlier envoy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275441 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[15:28:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1275416 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[15:28:53] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11839764 (10MatthewVernon) 05Open→03In progress
[15:30:04] <jouncebot>	 jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1530).
[15:30:09] <logmsgbot>	 jclark@cumin1003 reimage (PID 3304108) is awaiting input
[15:33:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:33:33] <wikibugs>	 (03PS2) 10Aude: Do not show donate button on affiliate wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876)
[15:34:49] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:34:53] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:35:48] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1166: Security update
[15:35:56] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pki::root_ca: create a new discovery intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275416 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[15:36:00] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:36:16] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[15:36:17] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1010.eqiad.wmnet with OS bookworm
[15:36:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host es2036
[15:36:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839832 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host dbstore1010.eqiad.wmnet with OS bookworm completed: - dbstore1010 (**PASS**)   - Removed from P...
[15:36:30] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2036
[15:36:58] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudelastic1012.eqiad.wmnet
[15:37:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839833 (10Jclark-ctr)
[15:37:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839837 (10Jclark-ctr) 05Open→03Resolved
[15:37:23] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:37:23] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:38:09] <wikibugs>	 (03PS1) 10Bking: Cirrussearch: remove unused hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1275444 (https://phabricator.wikimedia.org/T388607)
[15:38:34] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275444 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking)
[15:39:37] <wikibugs>	 (03PS1) 10Ottomata: html-enrich - use mw-api-int for stream config too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275445 (https://phabricator.wikimedia.org/T421216)
[15:40:05] <wikibugs>	 (03CR) 10Ottomata: [V:03+2 C:03+2] html-enrich - use mw-api-int for stream config too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275445 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[15:41:15] <wikibugs>	 (03CR) 10Bking: [C:03+2] Cirrussearch: remove unused hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1275444 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking)
[15:41:26] <wikibugs>	 (03CR) 10Effie Mouzeli: Decom various wikikube-workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275433 (https://phabricator.wikimedia.org/T42386) (owner: 10JMeybohm)
[15:41:32] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: move es2036 - https://phabricator.wikimedia.org/T423195#11839873 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm host was moved, netbox and dns updated. mgmt and network ping. ready to go back in. @Marostegui thank you for helping us with this!
[15:41:33] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:41:37] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:42:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11839880 (10Jclark-ctr) @elukey   did we have a work around for the usernames?
[15:42:42] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2036: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275447
[15:43:37] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] Decom various wikikube-workers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275442 (https://phabricator.wikimedia.org/T423863) (owner: 10JMeybohm)
[15:45:03] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "es2036: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275447 (owner: 10Marostegui)
[15:45:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11839891 (10elukey) @Jclark-ctr not yet, we haven't got a definitive reply from supermicro yet. I have some code patches lined up that should unblock...
[15:46:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:48:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5006.eqsin.wmnet with OS bookworm
[15:48:56] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11839919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5006.eqsin.wmnet with OS bookworm
[15:49:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: trace all file uploads [puppet] - 10https://gerrit.wikimedia.org/r/1272869 (owner: 10CDanis)
[15:50:10] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for wikikube-worker2188.codfw.wmnet
[15:50:10] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker2188.codfw.wmnet
[15:50:45] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2036: Moving to another rack
[15:50:49] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool es2036: Moving to another rack
[15:50:56] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2188.codfw.wmnet
[15:50:58] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2188.codfw.wmnet
[15:51:00] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2036: Moving to another rack
[15:51:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11839924 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 pool for host wikikube-worker2188.codfw.wmnet completed: - wikikube-worker2188...
[15:51:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11839937 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert tyvm :)
[15:51:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1273925 (owner: 10CDobbins)
[15:52:15] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] "(I do think it's confusing to have these two things in one change)" [puppet] - 10https://gerrit.wikimedia.org/r/1275430 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli)
[15:53:39] <wikibugs>	 (03PS2) 10JMeybohm: Decom various wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1275433 (https://phabricator.wikimedia.org/T423863)
[15:53:41] <wikibugs>	 (03PS2) 10JMeybohm: Decom various wikikube-workers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275442 (https://phabricator.wikimedia.org/T423863)
[15:55:02] <sukhe>	 !log sudo cumin -b31 "A:cp and not P{cp2041* or cp2042*}" "run-puppet-agent --enable 'merging CR 1272869'"
[15:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840036 (10LSobanski)
[15:57:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11840038 (10MoritzMuehlenhoff)
[15:57:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude)
[15:57:55] <moritzm>	 !log installing libvirt security updates
[15:57:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:08] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[16:00:10] <wikibugs>	 (03CR) 10JHathaway: firewall::service: Add a new parameter public_access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[16:00:28] <wikibugs>	 (03PS5) 10Jasmine: role::aux_k8s::worker: add sophroid to lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748)
[16:02:48] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11840075 (10Aklapper)
[16:03:19] <wikibugs>	 (03PS1) 10Scott French: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275449
[16:05:14] <pmiazga>	 Quick question - there is  change I'd like to backport today to wmf.24 -> there I need to push 4 commits.  What is the best way to do it? Do I cherry-pick and push 4 different things? or can I squash them into a single commit ?
[16:06:28] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: name=cloudelastic1012.eqiad.wmnet
[16:08:07] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Do not show donate button on affiliate wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude)
[16:08:28] <wikibugs>	 (03CR) 10Muehlenhoff: firewall::service: Add a new parameter public_access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[16:09:17] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:09:49] <wikibugs>	 (03PS1) 10Ottomata: html-enrich - update values with latest settings from T421216 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275453 (https://phabricator.wikimedia.org/T421216)
[16:11:16] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] html-enrich - update values with latest settings from T421216 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275453 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[16:13:12] <wikibugs>	 (03Merged) 10jenkins-bot: html-enrich - update values with latest settings from T421216 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275453 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[16:14:56] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275449 (owner: 10Scott French)
[16:16:24] <wikibugs>	 (03PS1) 10Ottomata: html-enrich - set tolerable-failed-checkpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275458 (https://phabricator.wikimedia.org/T421216)
[16:16:57] <wikibugs>	 (03CR) 10Ottomata: [V:03+2 C:03+2] html-enrich - set tolerable-failed-checkpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275458 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[16:17:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840160 (10MoritzMuehlenhoff) >>! In T422596#11833600, @jcrespo wrote: > In any case, backupmon1001.eqiad.wmnet is a very very tiny instance (an apache with just 1 user- me). No pr...
[16:17:30] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:17:34] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:17:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5006.eqsin.wmnet with reason: host reimage
[16:19:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM backupmon1001.eqiad.wmnet
[16:19:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840176 (10ops-monitoring-bot) VM backupmon1001.eqiad.wmnet rebooted by jmm@cumin2002 with reason: None
[16:21:15] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1166: Security update
[16:22:55] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM.  The only thing that does spring to mind is the name, not sure if we might have some services on non-public vlans that SREs might wa" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[16:23:59] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91226 and previous config saved to /var/cache/conftool/dbconfig/20260420-162359-fceratto.json
[16:24:04] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[16:24:38] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1275460
[16:24:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11840215 (10phaultfinder)
[16:25:50] <wikibugs>	 (03CR) 10Muehlenhoff: "Good point, maybe something along the lines of "unrestricted" instead of "public access" works better?" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[16:25:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5006.eqsin.wmnet with reason: host reimage
[16:25:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1275460 (owner: 10Marostegui)
[16:26:00] <logmsgbot>	 !log marostegui@dns1004 START - running authdns-update
[16:26:50] <marostegui>	 !log Switchover m3 proxy (phabricator)
[16:26:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:02] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] ipmi: rework how to use a different user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[16:27:27] <logmsgbot>	 !log marostegui@dns1004 END - running authdns-update
[16:28:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti5006 to the routed Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275461 (https://phabricator.wikimedia.org/T421863)
[16:28:51] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] firewall::service: Add a new parameter public_access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[16:29:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM backupmon1001.eqiad.wmnet
[16:29:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1003.eqiad.wmnet
[16:29:36] <wikibugs>	 (03CR) 10Dzahn: gerrit: update sync-instances cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb)
[16:29:49] <wikibugs>	 (03PS2) 10Herron: kafka-logging: update kafka-logging2001 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1273863 (https://phabricator.wikimedia.org/T423723)
[16:29:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840238 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet rebooted by jmm@cumin2002 with reason: None
[16:32:40] <wikibugs>	 (03PS2) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929)
[16:33:09] <wikibugs>	 (03CR) 10Elukey: ipmi: rework how to use a different user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[16:33:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1003.eqiad.wmnet
[16:34:02] <wikibugs>	 (03PS1) 10RLazarus: mw-wikifunctions: Remove in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311)
[16:34:06] <wikibugs>	 (03PS3) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929)
[16:34:07] <wikibugs>	 (03PS1) 10RLazarus: mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423311)
[16:34:07] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91227 and previous config saved to /var/cache/conftool/dbconfig/20260420-163407-fceratto.json
[16:34:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[16:34:17] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:34:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1003.eqiad.wmnet
[16:35:17] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840327 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet rebooted by jmm@cumin2002 with reason: None
[16:35:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11840329 (10herron) >>! In T423723#11837649, @elukey wrote: > @herron I would change a thing - I think it is sufficient to u...
[16:35:50] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[16:36:24] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2036: Moving to another rack
[16:36:30] <wikibugs>	 (03CR) 10Herron: [C:03+2] kafka-logging: update kafka-logging2001 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1273863 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron)
[16:36:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] ci::docker: only install docker-cli if on trixie or newer [puppet] - 10https://gerrit.wikimedia.org/r/1274067 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn)
[16:37:05] <wikibugs>	 (03PS1) 10RLazarus: mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275467 (https://phabricator.wikimedia.org/T423311)
[16:37:21] <wikibugs>	 (03PS4) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929)
[16:37:28] <herron>	 mutante: shall I go ahead and multiple?
[16:38:02] <mutante>	 herron: yes, multiple is fine. thanks!
[16:38:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1003.eqiad.wmnet
[16:38:59] <herron>	 mutante: annd done!
[16:39:23] <mutante>	 ty
[16:40:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[16:41:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11840357 (10MoritzMuehlenhoff)
[16:41:53] <wikibugs>	 (03CR) 10Elukey: "spicerack/hosts.py: note: In member "ipmi" of class "Host":" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[16:42:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11840364 (10herron)
[16:44:16] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91229 and previous config saved to /var/cache/conftool/dbconfig/20260420-164415-fceratto.json
[16:44:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1003.eqiad.wmnet
[16:45:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840376 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet rebooted by jmm@cumin2002 with reason: None
[16:47:19] <wikibugs>	 (03CR) 10JHathaway: ipmi: rework how to use a different user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[16:48:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1003.eqiad.wmnet
[16:48:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5006.eqsin.wmnet with OS bookworm
[16:48:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840395 (10MoritzMuehlenhoff)
[16:48:51] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11840396 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5006.eqsin.wmnet with OS bookworm completed: - ganeti5...
[16:52:36] <moritzm>	 !log installing imagemagick security updates
[16:52:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:23] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:54:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91230 and previous config saved to /var/cache/conftool/dbconfig/20260420-165423-fceratto.json
[16:54:28] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[16:54:42] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[16:54:51] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[16:55:00] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P91231 and previous config saved to /var/cache/conftool/dbconfig/20260420-165459-fceratto.json
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1700)
[17:00:05] <jouncebot>	 ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1700).
[17:00:06] <wikibugs>	 (03PS1) 10Bking: cloudelastic1012: move back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1275473 (https://phabricator.wikimedia.org/T422860)
[17:00:53] <wikibugs>	 (03CR) 10Bking: [C:03+2] cloudelastic1012: move back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1275473 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[17:02:50] <swfrench-wmf>	 I'll likely be deploying some non-mediawiki changes during the infra window (need a couple of minutes to double check some unrelated diffs)
[17:02:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie
[17:05:55] <wikibugs>	 (03PS5) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929)
[17:08:32] <wikibugs>	 06SRE: wiki.openstreetmap.org Commons thumbs rate limit allowance - https://phabricator.wikimedia.org/T423570#11840468 (10jcrespo) Let me ask, while all data I have access is already anonymous, it is still user's private data, just osm wiki is the referrer. Let me ask what parts (in any) I can disclose for peopl...
[17:09:06] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275449 (owner: 10Scott French)
[17:10:26] <wikibugs>	 (03PS6) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929)
[17:11:11] <wikibugs>	 (03CR) 10Elukey: "@" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[17:11:29] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275449 (owner: 10Scott French)
[17:14:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11840527 (10phaultfinder)
[17:14:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage
[17:15:04] <wikibugs>	 (03PS1) 10Pmiazga: Attribution: Clean up API spec descriptions [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275475 (https://phabricator.wikimedia.org/T422502)
[17:15:05] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1005.eqiad.wmnet with OS bullseye
[17:15:07] <wikibugs>	 (03PS1) 10Pmiazga: i18n: Use {{doc-markdown}} template in Attribution qqq.json [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275476 (https://phabricator.wikimedia.org/T422502)
[17:15:09] <wikibugs>	 (03PS1) 10Pmiazga: Attribution: Documentation copyedits [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477
[17:15:09] <wikibugs>	 (03PS1) 10Pmiazga: Attribution: Add localized texts for trending param [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275478 (https://phabricator.wikimedia.org/T423541)
[17:15:13] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11840529 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1005.eqiad.wmnet with OS bullseye
[17:16:54] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply
[17:17:20] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[17:17:21] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[17:17:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275475 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga)
[17:17:33] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[17:17:34] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[17:17:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275476 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga)
[17:17:47] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[17:17:48] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:17:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 (owner: 10Pmiazga)
[17:18:02] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:18:03] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[17:18:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275478 (https://phabricator.wikimedia.org/T423541) (owner: 10Pmiazga)
[17:18:20] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[17:18:21] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[17:18:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage
[17:18:43] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[17:21:44] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply
[17:22:05] <wikibugs>	 10SRE-swift-storage, 06Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#11840551 (10Ladsgroup) >>! In T352744#9413282, @MoritzMuehlenhoff wrote: >>>! In T352744#9413140, @jhathaway wrote: >> wolfssl is packaged in Debian, so that may be a possible option longer term, https://...
[17:22:31] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[17:23:02] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[17:23:32] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[17:24:04] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[17:24:20] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[17:24:51] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:25:07] <wikibugs>	 (03PS1) 10Elukey: profile::pki::intermediates: add discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275479 (https://phabricator.wikimedia.org/T420993)
[17:25:10] <wikibugs>	 (03PS1) 10Elukey: role::pki::multiroot: configure discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993)
[17:25:14] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:25:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:25:40] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11840566 (10MatthewVernon)
[17:25:45] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[17:25:51] <wikibugs>	 (03CR) 10Elukey: "This patch also needs the correspondent secret for the private key." [puppet] - 10https://gerrit.wikimedia.org/r/1275479 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[17:26:15] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[17:26:22] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[17:26:47] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[17:27:37] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1005.eqiad.wmnet with reason: host reimage
[17:27:43] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[17:28:30] <wikibugs>	 (03PS1) 10Elukey: Add fake private secrets for discovery2026 PKI intermediate [labs/private] - 10https://gerrit.wikimedia.org/r/1275481 (https://phabricator.wikimedia.org/T420993)
[17:28:56] <wikibugs>	 10SRE-swift-storage, 06Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#11840575 (10ssingh) >>! In T352744#11840551, @Ladsgroup wrote: >>>! In T352744#9413282, @MoritzMuehlenhoff wrote: >>>>! In T352744#9413140, @jhathaway wrote: >>> wolfssl is packaged in Debian, so that may...
[17:34:57] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1005.eqiad.wmnet with reason: host reimage
[17:35:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1012.eqiad.wmnet with OS trixie
[17:35:54] <wikibugs>	 (03CR) 10Alex Paskulin: [C:03+1] Attribution: Documentation copyedits [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 (owner: 10Pmiazga)
[17:36:37] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:37:11] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:38:06] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:41:49] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[17:42:36] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[17:42:37] <logmsgbot>	 elukey@cumin1003 provision (PID 3425810) is awaiting input
[17:43:08] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[17:43:45] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[17:44:17] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[17:44:30] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[17:44:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11840614 (10phaultfinder)
[17:45:01] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:45:17] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:45:26] <wikibugs>	 (03PS1) 10Bking: cloudelastic1012: move back to production role [puppet] - 10https://gerrit.wikimedia.org/r/1275485 (https://phabricator.wikimedia.org/T422860)
[17:45:48] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[17:46:10] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[17:46:23] <wikibugs>	 (03CR) 10Bking: [C:03+2] cloudelastic1012: move back to production role [puppet] - 10https://gerrit.wikimedia.org/r/1275485 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[17:46:41] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[17:47:54] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[17:47:55] <icinga-wm>	 PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[17:48:14] <mutante>	 ^ known - maintenance in progress
[17:48:55] <icinga-wm>	 RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[17:51:50] <mutante>	 new jenkins hosts will take over soon but havent just yet. WIP
[17:52:05] <wikibugs>	 (03CR) 10RLazarus: "James: Please review for "yep, we aren't expecting mw-mcrouter to have its own mcrouter on 127.0.0.1:11213 anymore."" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus)
[17:52:54] <wikibugs>	 (03CR) 10RLazarus: "James: Please review for whether this matches your expectations of what routes exist where (and the revised comment is up-to-date)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus)
[17:53:15] <wikibugs>	 (03CR) 10Herron: thanos/compact: avoid constant Puppet changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1273762 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli)
[17:54:16] <wikibugs>	 (03CR) 10Herron: [C:03+2] pyrra: remove configuration for web interface [puppet] - 10https://gerrit.wikimedia.org/r/1270992 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron)
[17:56:12] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11840735 (10elukey) I was able to repro:  ` 2026-04-20 17:39:37,345 elukey 3425810 [DEBUG wmflib.interactive:229 in confirm_on_failure] Traceback Traceback (most recent call la...
[17:56:45] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[17:56:50] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1005.eqiad.wmnet with OS bullseye
[17:56:58] <wikibugs>	 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11840743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1005.eqiad.wmnet with OS bullseye completed...
[17:59:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11840747 (10phaultfinder)
[18:05:15] <wikibugs>	 (03CR) 10Herron: [C:03+2] pyrra: remove pyrra/slo/slos dns entries [dns] - 10https://gerrit.wikimedia.org/r/1270995 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron)
[18:05:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1272832 (owner: 10JHathaway)
[18:05:37] <logmsgbot>	 !log herron@dns1004 START - running authdns-update
[18:06:45] <jinxer-wm>	 RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[18:07:08] <logmsgbot>	 !log herron@dns1004 END - running authdns-update
[18:09:02] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11840814 (10herron)
[18:09:17] <wikibugs>	 (03PS2) 10Jforrester: Attribution: Clean up API spec descriptions [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275475 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga)
[18:09:17] <wikibugs>	 (03PS2) 10Jforrester: i18n: Use {{doc-markdown}} template in Attribution qqq.json [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275476 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga)
[18:09:17] <wikibugs>	 (03PS2) 10Jforrester: Attribution: Documentation copyedits [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 (owner: 10Pmiazga)
[18:09:17] <wikibugs>	 (03PS2) 10Jforrester: Attribution: Add localized texts for trending param [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275478 (https://phabricator.wikimedia.org/T423541) (owner: 10Pmiazga)
[18:09:19] <wikibugs>	 (03PS1) 10Jforrester: Attribution: Update contact and add call to action [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275488 (https://phabricator.wikimedia.org/T422502)
[18:09:42] <wikibugs>	 (03CR) 10Herron: [C:03+2] "yes!" [puppet] - 10https://gerrit.wikimedia.org/r/1270974 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron)
[18:11:30] <Amir1>	 !log drop of langlinks table on testcommonswiki (T421914)
[18:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:34] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[18:12:58] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] ensure net.netfilter.nf_conntrack_max is updated [puppet] - 10https://gerrit.wikimedia.org/r/1272832 (owner: 10JHathaway)
[18:15:49] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[18:15:52] <wikibugs>	 (03PS1) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[18:16:15] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add ganeti5006 to the routed Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275461 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[18:17:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[18:17:21] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] profile::pki::intermediates: add discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275479 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[18:19:04] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Request for creation: Wikimedistas de Jujuy mailing list - https://phabricator.wikimedia.org/T423671#11840906 (10Ladsgroup) We need to find a different name for the mailing list. We are trying to standardize the mailing list names. See https://meta.wikimedia.org/wiki/Mailing_l...
[18:19:06] <wikibugs>	 (03CR) 10JHathaway: role::pki::multiroot: configure discovery2026 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[18:19:08] <wikibugs>	 (03PS2) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[18:19:15] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fbda72d1550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec
[18:19:15] <icinga-wm>	 dia.org/wiki/Search%23Administration
[18:19:21] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[18:19:36] <James_F>	 Heads-up: I'm going to backport an i18n change for MW-Interfaces, rather than have it swamp the normal window.
[18:19:41] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Add fake private secrets for discovery2026 PKI intermediate [labs/private] - 10https://gerrit.wikimedia.org/r/1275481 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[18:20:18] <wikibugs>	 (03PS3) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[18:20:21] <wikibugs>	 (03CR) 10Herron: [C:03+2] puppet: remove pyrra modules/profiles [puppet] - 10https://gerrit.wikimedia.org/r/1270996 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron)
[18:20:26] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[18:21:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275475 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga)
[18:21:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275476 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga)
[18:21:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 (owner: 10Pmiazga)
[18:21:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275488 (https://phabricator.wikimedia.org/T422502) (owner: 10Jforrester)
[18:21:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275478 (https://phabricator.wikimedia.org/T423541) (owner: 10Pmiazga)
[18:21:51] <wikibugs>	 (03PS4) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[18:23:10] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[18:24:01] <wikibugs>	 (03PS3) 10Herron: puppet: remove pyrra modules/profiles [puppet] - 10https://gerrit.wikimedia.org/r/1270996 (https://phabricator.wikimedia.org/T423307)
[18:24:15] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f37550f9550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec
[18:24:15] <icinga-wm>	 dia.org/wiki/Search%23Administration
[18:24:37] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 353.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:24:59] <wikibugs>	 (03Merged) 10jenkins-bot: Attribution: Clean up API spec descriptions [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275475 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga)
[18:25:01] <wikibugs>	 (03Merged) 10jenkins-bot: i18n: Use {{doc-markdown}} template in Attribution qqq.json [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275476 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga)
[18:25:03] <wikibugs>	 (03Merged) 10jenkins-bot: Attribution: Documentation copyedits [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 (owner: 10Pmiazga)
[18:25:04] <wikibugs>	 (03Merged) 10jenkins-bot: Attribution: Update contact and add call to action [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275488 (https://phabricator.wikimedia.org/T422502) (owner: 10Jforrester)
[18:25:06] <wikibugs>	 (03Merged) 10jenkins-bot: Attribution: Add localized texts for trending param [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275478 (https://phabricator.wikimedia.org/T423541) (owner: 10Pmiazga)
[18:25:30] <wikibugs>	 (03CR) 10Herron: [C:03+2] puppet: remove pyrra modules/profiles [puppet] - 10https://gerrit.wikimedia.org/r/1270996 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron)
[18:25:31] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1275475|Attribution: Clean up API spec descriptions (T422502)]], [[gerrit:1275476|i18n: Use {{doc-markdown}} template in Attribution qqq.json (T422502)]], [[gerrit:1275477|Attribution: Documentation copyedits]], [[gerrit:1275488|Attribution: Update contact and add call to action (T422502)]], [[gerrit:1275478|Attribution: Add localized texts for tren
[18:25:31] <logmsgbot>	 ding param (T423541)]]
[18:25:37] <stashbot>	 T422502: Clean up Attribution API spec descriptions - https://phabricator.wikimedia.org/T422502
[18:25:37] <stashbot>	 T423541: 'trending' signal in the Attribution API is not returning the correct descriptions in the schema - https://phabricator.wikimedia.org/T423541
[18:28:00] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Request for creation: Wikimedistas de Jujuy mailing list - https://phabricator.wikimedia.org/T423671#11840986 (10Luisfff2812) Hi @Ladsgroup thank you for your guidance! We plan to apply for User Group recognition starting next year, this year we are focused on strengthening ou...
[18:29:23] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:29:35] <wikibugs>	 (03PS5) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[18:29:50] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[18:29:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841002 (10phaultfinder)
[18:34:16] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Request for creation: Wikimedistas de Jujuy mailing list - https://phabricator.wikimedia.org/T423671#11841011 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done! https://lists.wikimedia.org/postorius/lists/wikimedia-jujuy.lists.wikimedia.org
[18:36:06] <wikibugs>	 (03PS6) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[18:36:19] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Request for creation: Wikimedistas de Jujuy mailing list - https://phabricator.wikimedia.org/T423671#11841016 (10Luisfff2812) Thank you so much, @Ladsgroup!!!
[18:37:58] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[18:41:07] <wikibugs>	 (03CR) 10Jforrester: "@zabe I cherry-picked this speculatively; do you think we should deploy it?" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1273787 (https://phabricator.wikimedia.org/T423654) (owner: 10Jforrester)
[18:42:20] <logmsgbot>	 !log jforrester@deploy1003 pmiazga, jforrester: Backport for [[gerrit:1275475|Attribution: Clean up API spec descriptions (T422502)]], [[gerrit:1275476|i18n: Use {{doc-markdown}} template in Attribution qqq.json (T422502)]], [[gerrit:1275477|Attribution: Documentation copyedits]], [[gerrit:1275488|Attribution: Update contact and add call to action (T422502)]], [[gerrit:1275478|Attribution: Add localized texts for trending
[18:42:20] <logmsgbot>	 param (T423541)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:42:31] <stashbot>	 T422502: Clean up Attribution API spec descriptions - https://phabricator.wikimedia.org/T422502
[18:42:31] <stashbot>	 T423541: 'trending' signal in the Attribution API is not returning the correct descriptions in the schema - https://phabricator.wikimedia.org/T423541
[18:42:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622#11841044 (10Jdforrester-WMF) 05In progress→03Resolved OK, this should now be Resolved. Hopefully.
[18:43:25] <wikibugs>	 (03PS7) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[18:43:36] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[18:44:16] <logmsgbot>	 !log jforrester@deploy1003 pmiazga, jforrester: Continuing with sync
[18:47:00] <wikibugs>	 10SRE-Access-Requests: Add spare FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T423943 (10jasmine_) 03NEW
[18:49:21] <wikibugs>	 (03PS8) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[18:50:57] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[18:52:46] <wikibugs>	 (03PS3) 10Effie Mouzeli: site.pp: switch insetup rdb* servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1275430
[18:53:23] <dancy>	 James_F: I'd like to deploy a scap update when you're done.
[18:54:17] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:54:37] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] mw-wikifunctions: Remove in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus)
[18:55:14] <wikibugs>	 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11841090 (10herron)
[18:55:23] <wikibugs>	 (03PS9) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[18:55:54] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275475|Attribution: Clean up API spec descriptions (T422502)]], [[gerrit:1275476|i18n: Use {{doc-markdown}} template in Attribution qqq.json (T422502)]], [[gerrit:1275477|Attribution: Documentation copyedits]], [[gerrit:1275488|Attribution: Update contact and add call to action (T422502)]], [[gerrit:1275478|Attribution: Add localized texts for tre
[18:55:54] <logmsgbot>	 nding param (T423541)]] (duration: 30m 23s)
[18:55:56] <James_F>	 dancy: Absolutely; should be done now.
[18:55:58] <stashbot>	 T422502: Clean up Attribution API spec descriptions - https://phabricator.wikimedia.org/T422502
[18:55:58] <stashbot>	 T423541: 'trending' signal in the Attribution API is not returning the correct descriptions in the schema - https://phabricator.wikimedia.org/T423541
[18:56:08] <dancy>	 Thanks! 
[18:56:24] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.249.0" for 2 host(s)
[18:56:28] <wikibugs>	 (03PS4) 10Effie Mouzeli: site.pp: switch insetup rdb* servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1275430
[18:57:14] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: site.pp: switch insetup rdb* servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1275430 (owner: 10Effie Mouzeli)
[18:58:18] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.249.0" completed for 2 hosts
[18:58:47] <dancy>	 I'm done.
[18:59:17] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:59:21] <wikibugs>	 (03PS2) 10Elukey: role::pki::multiroot: configure discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993)
[18:59:34] <wikibugs>	 (03CR) 10Elukey: role::pki::multiroot: configure discovery2026 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[19:00:00] <wikibugs>	 (03PS10) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[19:00:09] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[19:00:29] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2190: Security update
[19:00:45] <wikibugs>	 (03PS1) 10Effie Mouzeli: site.pp: switch insetup rdb* servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1275497
[19:02:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good. There is still one ferm service to port to firewall::serviee before the rdb* hosts are ready (redis_master_role), but easy eno" [puppet] - 10https://gerrit.wikimedia.org/r/1275497 (owner: 10Effie Mouzeli)
[19:03:46] <wikibugs>	 (03PS2) 10Scott French: P:mediawiki::php: Support component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275491 (https://phabricator.wikimedia.org/T422964)
[19:04:14] <wikibugs>	 (03PS2) 10Scott French: hieradata: Switch deployment hosts to component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275492 (https://phabricator.wikimedia.org/T422964)
[19:04:16] <wikibugs>	 (03PS2) 10Scott French: hieradata: Switch parsoidtest1001 to component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275493 (https://phabricator.wikimedia.org/T422964)
[19:04:21] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:05:10] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] role::pki::multiroot: configure discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[19:05:29] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] role::pki::multiroot: configure discovery2026 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[19:07:15] <wikibugs>	 (03PS11) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[19:07:51] <wikibugs>	 (03PS1) 10Jasmine: admin: add spare FIDO backed key [Jasmine] [puppet] - 10https://gerrit.wikimedia.org/r/1275501 (https://phabricator.wikimedia.org/T423943)
[19:08:21] <wikibugs>	 (03PS1) 10Effie Mouzeli: (DNM) site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275502 (https://phabricator.wikimedia.org/T418261)
[19:11:26] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[19:12:22] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "Verified OOB" [puppet] - 10https://gerrit.wikimedia.org/r/1275501 (https://phabricator.wikimedia.org/T423943) (owner: 10Jasmine)
[19:12:41] <wikibugs>	 (03CR) 10Jforrester: "See my existing patch, though we can use this one instead if you prefer. But let's follow MSB naming here (so evaluator-rust-javascript no" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg)
[19:14:25] <wikibugs>	 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11841207 (10herron)
[19:16:14] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] hieradata: Switch deployment hosts to component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275492 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French)
[19:16:59] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1275491 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French)
[19:17:10] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] P:mediawiki::php: Support component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275491 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French)
[19:19:38] <wikibugs>	 (03PS12) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[19:19:42] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[19:21:54] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] hieradata: Switch parsoidtest1001 to component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275493 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French)
[19:22:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11841270 (10wiki_willy) @Jclark-ctr & @VRiley-WMF - can you provide a status on this one?
[19:25:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11841294 (10elukey) @ayounsi I think that iDRAC 10 hosts don't support the new LLDP code :( T418899#11840735
[19:27:30] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: skip LLDP settings for iDRAC 10+ hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367)
[19:28:46] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:29:01] <wikibugs>	 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11841312 (10herron)
[19:29:04] <wikibugs>	 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11841313 (10herron) 05Open→03Resolved a:03herron
[19:29:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11841318 (10ssingh)
[19:30:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11841319 (10ssingh) Clarified the scope of work, "Set up lvs1017 with new NIC" is DC Ops and then Traffic is responsible for the other bits in the task ("Promote lvs1017").
[19:33:56] <wikibugs>	 (03PS13) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[19:33:56] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudinfra hiera: remove obsolete hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1275511
[19:34:07] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[19:34:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[19:37:55] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw-wikifunctions: Remove in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus)
[19:37:57] <wikibugs>	 (03PS14) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[19:38:02] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[19:38:16] <wikibugs>	 (03CR) 10Elukey: "Not pretty I know, but I haven't found a good solution yet :(" [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367) (owner: 10Elukey)
[19:43:38] <wikibugs>	 (03PS15) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[19:43:48] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[19:44:19] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "and re-verified just for good measure!" [puppet] - 10https://gerrit.wikimedia.org/r/1275501 (https://phabricator.wikimedia.org/T423943) (owner: 10Jasmine)
[19:44:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841349 (10phaultfinder)
[19:45:04] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus)
[19:46:03] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2190: Security update
[19:47:53] <logmsgbot>	 elukey@cumin1003 provision (PID 3504517) is awaiting input
[19:48:16] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11841356 (10elukey) The other error seems to be:  ` Created attribute BIOS.Setup.1-1 -> UncoreFrequency (with Set On Import True) with value DynamicUFS `
[19:48:19] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:53:15] <wikibugs>	 (03PS16) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[19:53:41] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[19:53:47] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:53:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[19:57:30] <wikibugs>	 (03PS17) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[19:57:34] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[19:58:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T2000).
[20:00:05] <jouncebot>	 aude: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:23] <aude>	 hi
[20:00:53] <aude>	 looks like mine is the only patch so i can handle it
[20:02:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude)
[20:04:47] <wikibugs>	 (03Merged) 10jenkins-bot: Do not show donate button on affiliate wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude)
[20:05:05] <logmsgbot>	 !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1275428|Do not show donate button on affiliate wikis (T423876)]]
[20:05:10] <stashbot>	 T423876: Remove donate button on Vector 2022 from affiliate wikis - https://phabricator.wikimedia.org/T423876
[20:06:30] <wikibugs>	 (03PS18) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646)
[20:08:37] <logmsgbot>	 !log aude@deploy1003 aude: Backport for [[gerrit:1275428|Do not show donate button on affiliate wikis (T423876)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:10:15] <logmsgbot>	 !log aude@deploy1003 aude: Continuing with sync
[20:13:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add spare FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T423942#11841439 (10Aklapper) →14Duplicate dup:03T423943
[20:13:42] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add spare FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T423943#11841441 (10Aklapper)
[20:13:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add spare FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T423943#11841444 (10Aklapper) a:03jasmine_
[20:16:02] <logmsgbot>	 !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275428|Do not show donate button on affiliate wikis (T423876)]] (duration: 10m 57s)
[20:16:03] <wikibugs>	 (03CR) 10Andrew Bogott: "only 18 tries to get puppet reduce working :(" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[20:16:06] <stashbot>	 T423876: Remove donate button on Vector 2022 from affiliate wikis - https://phabricator.wikimedia.org/T423876
[20:19:15] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1632, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_
[20:19:15] <icinga-wm>	 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:39:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841520 (10phaultfinder)
[20:41:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11841521 (10Jhancock.wm) @elukey did we get anything back from SM on the ticket you opened for this one?
[20:45:15] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 825, active_shards: 1651, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassig
[20:45:15] <icinga-wm>	 ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:48:15] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1533, relocating_shards: 1, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_
[20:48:15] <icinga-wm>	 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:52:08] <cscott>	 sorry, i'm late to the backport window. are backports still in progress?
[20:53:23] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:53:31] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[20:54:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841538 (10phaultfinder)
[20:55:16] <wikibugs>	 (03PS2) 10Ecarg: Wikifunctions: add helm values for function-evaluator in Rust [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627)
[20:55:39] <wikibugs>	 (03CR) 10Ecarg: "sry, do you have a link to that patch? I'm having trouble finding it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg)
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T2100).
[21:02:15] <sbassett>	 Hey all - we have a couple of security patches to get out today...
[21:03:12] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch on k8s: Activate semantic-search and ipoid in services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1272909 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking)
[21:22:33] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 7846.32 ms
[21:23:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Trove guest-agent: update postgresql and mariadb backup versions [puppet] - 10https://gerrit.wikimedia.org/r/1261579 (https://phabricator.wikimedia.org/T420737) (owner: 10Andrew Bogott)
[21:24:21] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[21:24:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841633 (10phaultfinder)
[21:25:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:25:55] <maryum>	 preparing to run scap
[21:28:08] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:29:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11841638 (10Jclark-ctr) @Jgreen I have not received any updates on mgmt usernames, but I have a feeling we will not be able to use “root” as the username on mgmt for Supermicro...
[21:31:00] <wikibugs>	 (03PS1) 10Ryan Kemper: prometheus: fix wmf-elasticsearch-exporter listen address on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860)
[21:33:08] <maryum>	 Deployed security fix for T299359
[21:33:09] <wikibugs>	 (03PS2) 10Ryan Kemper: prometheus: fix wmf-elasticsearch-exporter listen address on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860)
[21:33:11] <maryum>	 !log Deployed security fix for T299359
[21:33:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:14] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper)
[21:34:30] <rzl>	 maryum: if you don't mind pinging me whenever you're finished, I've got some stuff to go out, but no rush :)
[21:34:42] <maryum>	 yes about to run scap once more and then I'm done
[21:34:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841648 (10phaultfinder)
[21:34:48] <rzl>	 rad
[21:35:25] <wikibugs>	 (03PS3) 10Ryan Kemper: prometheus: fix wmf-elasticsearch-exporter listen address on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860)
[21:38:17] <wikibugs>	 (03CR) 10Catrope: [C:03+2] Set CSP to enforce with currently-allow-listed domains on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett)
[21:39:08] <wikibugs>	 (03Merged) 10jenkins-bot: Set CSP to enforce with currently-allow-listed domains on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett)
[21:40:01] <wikibugs>	 (03CR) 10Bking: [C:03+1] prometheus: fix wmf-elasticsearch-exporter listen address on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper)
[21:41:09] <wikibugs>	 (03CR) 10Bking: [C:03+1] "Confirmed working on cloudelastic1011 (bullseye/Python 3.9) and cloudelastic1012 (trixie/Python 3.12)" [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper)
[21:42:41] <maryum>	 rzl: finished with scap
[21:42:53] <maryum>	 !log Deployed security fix for T406954
[21:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:26] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] prometheus: fix wmf-elasticsearch-exporter listen address on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper)
[21:44:38] <rzl>	 maryum: thanks!
[21:44:55] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mw-wikifunctions: Remove in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus)
[21:45:13] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus)
[21:45:31] <wikibugs>	 (03CR) 10RLazarus: mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus)
[21:46:59] <zabe>	 "Warning: Undefined array key "default" in /srv/mediawiki-staging/wmf-config/CommonSettings-labs.php on line 576"
[21:47:16] <wikibugs>	 (03Merged) 10jenkins-bot: mw-wikifunctions: Remove in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus)
[21:49:43] <Reedy>	 maryum: sbassett ^ I think the config patch has upset beta
[21:49:55] <maryum>	 I didn't deploy anything to config
[21:50:13] <maryum>	 reedy: only one to core and one to abuse filter
[21:50:25] <Reedy>	 Sure, but the patch has been merged, it will be deployed automatically to beta
[21:51:07] <maryum>	 reedy: wonder if I should revert both patches
[21:51:07] <zabe>	 Yeah the 'default' access here is wrong: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1272895/5/wmf-config/CommonSettings-labs.php
[21:51:10] <wikibugs>	 (03CR) 10Reedy: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett)
[21:51:19] <rzl>	 I'm holding off, it's all yours if you need it
[21:51:43] <Reedy>	 Let me fix
[21:51:50] <maryum>	 reedy: oky
[21:52:20] <rzl>	 since I've merged my deployment-charts change, the helmfile diffs will come along when you run scap -- that's fine by me, I'll be here to monitor, but I can revert if you'd prefer to do one thing at a time
[21:52:46] <rzl>	 (or I can push mine out quickly and be out of your way)
[21:53:03] <wikibugs>	 (03PS1) 10Reedy: CommonSettings-labs: Fix up CSP config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275536 (https://phabricator.wikimedia.org/T419612)
[21:53:53] <Reedy>	 rzl: Don't need to hold off for me
[21:54:10] <wikibugs>	 (03CR) 10Reedy: [C:03+2] CommonSettings-labs: Fix up CSP config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275536 (https://phabricator.wikimedia.org/T419612) (owner: 10Reedy)
[21:54:33] <rzl>	 Reedy: cool, starting a helmfile-only scap then
[21:55:03] <wikibugs>	 (03PS1) 10Dzahn: jenkins: add firewall rule for new jenkins to gearman on legacy host [puppet] - 10https://gerrit.wikimedia.org/r/1275537 (https://phabricator.wikimedia.org/T418521)
[21:56:04] <wikibugs>	 (03CR) 10Reedy: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett)
[21:56:05] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings-labs: Fix up CSP config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275536 (https://phabricator.wikimedia.org/T419612) (owner: 10Reedy)
[21:57:03] <logmsgbot>	 !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1275463 T423311 T423624
[21:57:08] <stashbot>	 T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?) - https://phabricator.wikimedia.org/T423311
[21:57:09] <stashbot>	 T423624: Drop in-pod mcrouter from mw-wikifunctions pod, no longer used - https://phabricator.wikimedia.org/T423624
[21:57:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add spare FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T423943#11841701 (10jasmine_) 05Open→03Resolved
[21:58:53] <wikibugs>	 (03CR) 10SBassett: "Ugh, thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275536 (https://phabricator.wikimedia.org/T419612) (owner: 10Reedy)
[21:58:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:59:11] <logmsgbot>	 !log rzl@deploy1003 Finished scap sync-world: https://gerrit.wikimedia.org/r/1275463 T423311 T423624 (duration: 03m 24s)
[21:59:20] <wikibugs>	 (03CR) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett)
[22:17:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274387 (owner: 10C. Scott Ananian)
[22:21:51] <wikibugs>	 (03PS1) 10C. Scott Ananian: Revert "Skin: Avoid stretching low resolution images" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275540 (https://phabricator.wikimedia.org/T421524)
[22:22:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275540 (https://phabricator.wikimedia.org/T421524) (owner: 10C. Scott Ananian)
[22:23:52] <wikibugs>	 (03CR) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett)
[22:25:17] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mwscript-k8s: add --output-file flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis)
[22:25:26] <wikibugs>	 (03PS6) 10Jdlrobson: Restore PageImages functionality to Wikisources and Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez)
[22:25:45] <wikibugs>	 (03PS2) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.23.0-a28 [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275541 (https://phabricator.wikimedia.org/T420102)
[22:26:18] <wikibugs>	 (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.23.0-a28 [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275542 (https://phabricator.wikimedia.org/T423662)
[22:26:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275542 (https://phabricator.wikimedia.org/T423662) (owner: 10C. Scott Ananian)
[22:26:58] <wikibugs>	 (03CR) 10Reedy: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett)
[22:29:40] <wikibugs>	 (03CR) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett)
[22:37:02] <wikibugs>	 (03PS1) 10Jdlrobson: Don't set href for a link that has been unset [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275543 (https://phabricator.wikimedia.org/T422907)
[22:52:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2022.codfw.wmnet with OS trixie
[22:52:39] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11841905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2022.codfw.wmnet with OS trixie
[22:52:49] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2023.codfw.wmnet with OS trixie
[22:53:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2024.codfw.wmnet with OS trixie
[22:53:12] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11841906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2023.codfw.wmnet with OS trixie
[22:53:15] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11841907 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2024.codfw.wmnet with OS trixie
[22:58:31] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[22:59:04] <wikibugs>	 (03CR) 10RLazarus: "Good idea! Comments on the implementation but no objections to doing it." [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis)
[23:00:04] <jouncebot>	 Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T2300)
[23:02:17] <Jdlrobson>	 starting some deploys shortly
[23:02:23] <Jdlrobson>	 let me know if any reason not to
[23:03:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275540 (https://phabricator.wikimedia.org/T421524) (owner: 10C. Scott Ananian)
[23:06:39] <wikibugs>	 (03PS1) 10Jdlrobson: [Mobile Page Previews] Avoid syntax error on older browsers [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275547 (https://phabricator.wikimedia.org/T423959)
[23:06:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2022.codfw.wmnet with reason: host reimage
[23:07:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2023.codfw.wmnet with reason: host reimage
[23:07:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2024.codfw.wmnet with reason: host reimage
[23:07:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Skin: Avoid stretching low resolution images" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275540 (https://phabricator.wikimedia.org/T421524) (owner: 10C. Scott Ananian)
[23:10:39] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1275540|Revert "Skin: Avoid stretching low resolution images" (T421524 T423676)]]
[23:10:45] <stashbot>	 T421524: Small images are scaled up by thumbnail preference - https://phabricator.wikimedia.org/T421524
[23:10:45] <stashbot>	 T423676: Infobox images have huge padding in Firefox - https://phabricator.wikimedia.org/T423676
[23:12:06] <wikibugs>	 (03CR) 10Eric Gardner: [C:03+1] [Mobile Page Previews] Avoid syntax error on older browsers [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275547 (https://phabricator.wikimedia.org/T423959) (owner: 10Jdlrobson)
[23:12:21] <logmsgbot>	 !log jdlrobson@deploy1003 cscott, jdlrobson: Backport for [[gerrit:1275540|Revert "Skin: Avoid stretching low resolution images" (T421524 T423676)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:12:47] <logmsgbot>	 !log jdlrobson@deploy1003 cscott, jdlrobson: Continuing with sync
[23:14:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2022.codfw.wmnet with reason: host reimage
[23:16:36] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275540|Revert "Skin: Avoid stretching low resolution images" (T421524 T423676)]] (duration: 05m 56s)
[23:16:41] <stashbot>	 T421524: Small images are scaled up by thumbnail preference - https://phabricator.wikimedia.org/T421524
[23:16:41] <stashbot>	 T423676: Infobox images have huge padding in Firefox - https://phabricator.wikimedia.org/T423676
[23:17:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275547 (https://phabricator.wikimedia.org/T423959) (owner: 10Jdlrobson)
[23:19:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2023.codfw.wmnet with reason: host reimage
[23:19:58] <wikibugs>	 (03Merged) 10jenkins-bot: [Mobile Page Previews] Avoid syntax error on older browsers [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275547 (https://phabricator.wikimedia.org/T423959) (owner: 10Jdlrobson)
[23:20:12] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1275547|[Mobile Page Previews] Avoid syntax error on older browsers (T423959)]]
[23:20:24] <stashbot>	 T423959: Page Previews: Instrumentation code throws syntax errors in older browsers - https://phabricator.wikimedia.org/T423959
[23:21:48] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1275547|[Mobile Page Previews] Avoid syntax error on older browsers (T423959)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:24:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2024.codfw.wmnet with reason: host reimage
[23:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842040 (10phaultfinder)
[23:24:38] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Continuing with sync
[23:28:25] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275547|[Mobile Page Previews] Avoid syntax error on older browsers (T423959)]] (duration: 08m 13s)
[23:28:29] <stashbot>	 T423959: Page Previews: Instrumentation code throws syntax errors in older browsers - https://phabricator.wikimedia.org/T423959
[23:29:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2022.codfw.wmnet with OS trixie
[23:29:21] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11842044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2022.codfw.wmnet with OS trixie completed: - pc2022 (**WARN**)   - Dow...
[23:30:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez)
[23:32:04] <wikibugs>	 (03Merged) 10jenkins-bot: Restore PageImages functionality to Wikisources and Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez)
[23:32:21] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1271862|Restore PageImages functionality to Wikisources and Wikibooks (T417538)]]
[23:32:25] <stashbot>	 T417538: Enable PageImages by default for Wikisource and Wikibooks - https://phabricator.wikimedia.org/T417538
[23:34:01] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson, ignaciorodrguez: Backport for [[gerrit:1271862|Restore PageImages functionality to Wikisources and Wikibooks (T417538)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:34:18] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2023.codfw.wmnet with OS trixie
[23:36:24] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson, ignaciorodrguez: Continuing with sync
[23:39:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2024.codfw.wmnet with OS trixie
[23:39:20] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11842080 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2024.codfw.wmnet with OS trixie completed: - pc2024 (**WARN**)   - Dow...
[23:39:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1275554
[23:39:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1275554 (owner: 10TrainBranchBot)
[23:40:08] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271862|Restore PageImages functionality to Wikisources and Wikibooks (T417538)]] (duration: 07m 47s)
[23:40:17] <stashbot>	 T417538: Enable PageImages by default for Wikisource and Wikibooks - https://phabricator.wikimedia.org/T417538
[23:50:54] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1275554 (owner: 10TrainBranchBot)
[23:53:15] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11842127 (10Jhancock.wm) 05Open→03Resolved @Marostegui fixed it. but please reopen the ticket if anything seems off.
[23:54:02] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus