[00:24:36] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [00:24:40] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [00:33:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [00:34:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:45:45] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [00:45:45] Deployment linkrecommendation-internal in linkrecommendation at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [00:45:45] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [02:09:17] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:35] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11837345 (10jhathaway) @MoritzMuehlenhoff I tried to reproduce the issue on Friday afternoon, but I was unable to trigger it with simulated loads via cumin. I rat... [02:34:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:23] FIRING: [8x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:44:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:48:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:05:52] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [03:05:57] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [03:14:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:19:37] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [03:19:42] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [03:24:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:30:33] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:39:29] 06SRE, 06Traffic: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11837399 (10Naruse_shiroha) Any update on this after one month...? [04:10:33] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:56:48] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11837462 (10Marostegui) Amazing thank you! [05:08:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11837463 (10Marostegui) 05Resolved→03Open @Jhancock.wm unfortunately pc2022, pc2023 and pc2024 have the wrong RAID. They should have RAID10 but they have RAID 0 pc2021 is corre... [05:09:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11837466 (10Marostegui) @VRiley-WMF please note these hosts require RAID 10 (just saying cause there were some config confusion in codfw and they ended with RAID 0 instead). [05:17:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:25:57] (03PS1) 10Marostegui: db2151: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275246 [05:27:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 34.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:31:56] (03CR) 10Marostegui: [C:03+2] db2151: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275246 (owner: 10Marostegui) [05:32:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2151.codfw.wmnet with reason: Reimage to Trixie [05:32:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2151: Reimage to Trixie [05:32:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2151: Reimage to Trixie [05:33:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2151.codfw.wmnet with OS trixie [05:47:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:53:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2151.codfw.wmnet with reason: host reimage [05:59:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2151.codfw.wmnet with reason: host reimage [06:06:20] !log Removed categorylinks_icu72 from s1 and s6 T422546 [06:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:24] T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546 [06:10:06] (03PS1) 10Marostegui: Revert "db2151: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275248 [06:11:44] (03CR) 10Marostegui: [C:03+2] Revert "db2151: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275248 (owner: 10Marostegui) [06:22:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2151.codfw.wmnet with OS trixie [06:22:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2151: after reimage to trixie [06:26:02] (03CR) 10Muehlenhoff: [C:03+2] Remove bast1003 from list of bastions [puppet] - 10https://gerrit.wikimedia.org/r/1273413 (owner: 10Muehlenhoff) [06:26:53] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2151: after reimage to trixie [06:26:59] (03CR) 10Arnaudb: gerrit: update sync-instances cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb) [06:27:12] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2151: repool after maintenance [06:30:56] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1275249 (https://phabricator.wikimedia.org/T423837) [06:35:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2214 with weight 0 T423837', diff saved to https://phabricator.wikimedia.org/P91127 and previous config saved to /var/cache/conftool/dbconfig/20260420-063553-marostegui.json [06:35:59] T423837: Switchover s6 master (db2229 -> db2214) - https://phabricator.wikimedia.org/T423837 [06:36:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 21 hosts with reason: Primary switchover s6 T423837 [06:36:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:36:44] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1275249 (https://phabricator.wikimedia.org/T423837) (owner: 10Gerrit maintenance bot) [06:39:43] !log Starting s6 codfw failover from db2229 to db2214 - T423837 [06:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2214 to s6 primary T423837', diff saved to https://phabricator.wikimedia.org/P91128 and previous config saved to /var/cache/conftool/dbconfig/20260420-064006-marostegui.json [06:40:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2229 T423837', diff saved to https://phabricator.wikimedia.org/P91129 and previous config saved to /var/cache/conftool/dbconfig/20260420-064042-marostegui.json [06:40:46] (03PS1) 10Muehlenhoff: Remove bast5004 from list of active bastions [puppet] - 10https://gerrit.wikimedia.org/r/1275250 (https://phabricator.wikimedia.org/T421863) [06:41:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:43:06] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [06:43:23] FIRING: [8x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:45:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:48:26] (03PS1) 10Marostegui: db2229: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275252 [06:49:37] ACKNOWLEDGEMENT - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: Marostegui Host will be decommed https://wikitech.wikimedia.org/wiki/Orchestrator [06:49:40] (03CR) 10Marostegui: [C:03+2] db2229: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275252 (owner: 10Marostegui) [06:50:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:50:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2229.codfw.wmnet with reason: Reimage to Trixie [06:50:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2229: Reimage to Trixie [06:50:36] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, re: pontoon changes, they are always safe wrt breaking production (i.e. they can't)" [puppet] - 10https://gerrit.wikimedia.org/r/1273833 (owner: 10Andrew Bogott) [06:50:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2229: Reimage to Trixie [06:51:06] (03PS1) 10Muehlenhoff: firewall::service: Add a new parameter public_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) [06:52:06] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2229.codfw.wmnet with OS trixie [06:52:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:52:45] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804#11837619 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [06:57:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:57:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2151: repool after maintenance [06:59:09] (03PS1) 10Marostegui: Revert "db2229: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275254 [07:00:05] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:07:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [07:07:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:07:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T419635)', diff saved to https://phabricator.wikimedia.org/P91132 and previous config saved to /var/cache/conftool/dbconfig/20260420-070728-fceratto.json [07:07:42] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:09:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T419635)', diff saved to https://phabricator.wikimedia.org/P91133 and previous config saved to /var/cache/conftool/dbconfig/20260420-070941-fceratto.json [07:10:31] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2229.codfw.wmnet with reason: host reimage [07:12:09] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11837649 (10elukey) @herron I would change a thing - I think it is sufficient to upgrade a single host (like https://gerrit.... [07:14:56] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2229.codfw.wmnet with reason: host reimage [07:15:27] (03CR) 10Slyngshede: [C:03+2] data: align config [puppet] - 10https://gerrit.wikimedia.org/r/1273658 (owner: 10Slyngshede) [07:16:38] (03CR) 10Filippo Giunchedi: [C:03+1] Switch Cloud VPS to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1273441 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [07:19:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P91134 and previous config saved to /var/cache/conftool/dbconfig/20260420-071949-fceratto.json [07:20:35] (03CR) 10Ayounsi: [C:03+1] Remove bast5004 from list of active bastions [puppet] - 10https://gerrit.wikimedia.org/r/1275250 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:26:24] (03CR) 10Ayounsi: "Idea lgtm, can you just run PCC on a random host to make sure it's a real NOOP ?" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [07:27:17] (03CR) 10Marostegui: [C:03+2] Revert "db2229: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275254 (owner: 10Marostegui) [07:29:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P91135 and previous config saved to /var/cache/conftool/dbconfig/20260420-072957-fceratto.json [07:30:01] !log Removed categorylinks_icu72 from s12 T422546 [07:30:03] !log Removed categorylinks_icu72 from s2 T422546 [07:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:04] T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546 [07:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:15] !log Removed categorylinks_icu72 from s7 T422546 [07:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:45] (03CR) 10Muehlenhoff: [C:03+2] Remove bast5004 from list of active bastions [puppet] - 10https://gerrit.wikimedia.org/r/1275250 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:38:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2229.codfw.wmnet with OS trixie [07:40:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T419635)', diff saved to https://phabricator.wikimedia.org/P91136 and previous config saved to /var/cache/conftool/dbconfig/20260420-074005-fceratto.json [07:40:10] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:40:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:40:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T419635)', diff saved to https://phabricator.wikimedia.org/P91137 and previous config saved to /var/cache/conftool/dbconfig/20260420-074031-fceratto.json [07:41:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2229: after reimage to trixie [07:44:45] (03PS1) 10Muehlenhoff: Update IP resolve spec test to use bast1004 instead of bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/1275257 (https://phabricator.wikimedia.org/T423673) [07:47:14] (03PS1) 10Muehlenhoff: Remove bast5004 [puppet] - 10https://gerrit.wikimedia.org/r/1275258 (https://phabricator.wikimedia.org/T421863) [07:47:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:48:27] (03CR) 10Ayounsi: [C:03+1] "Overall lgtm, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [07:49:20] (03CR) 10Ayounsi: [C:03+1] Remove bast5004 [puppet] - 10https://gerrit.wikimedia.org/r/1275258 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:51:08] !log Removed categorylinks_icu72 from s5 T422546 [07:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:12] T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546 [07:52:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:55:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T419635)', diff saved to https://phabricator.wikimedia.org/P91139 and previous config saved to /var/cache/conftool/dbconfig/20260420-075524-fceratto.json [07:55:29] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:57:34] (03CR) 10Klausman: [C:03+1] istio: revisit Prometheus buckets for Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [07:59:02] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 12389 [07:59:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 12389 [08:01:00] !log Removed categorylinks_icu72 from s3 with a sleep, this will around 1.5 hours T422546 [08:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:06] T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546 [08:02:19] (03PS1) 10MVernon: apus: move eqiad controller moss-be1001 -> apus-be1005 [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) [08:04:34] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast5004.wikimedia.org [08:05:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [08:05:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T419961)', diff saved to https://phabricator.wikimedia.org/P91141 and previous config saved to /var/cache/conftool/dbconfig/20260420-080529-fceratto.json [08:05:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P91142 and previous config saved to /var/cache/conftool/dbconfig/20260420-080539-fceratto.json [08:06:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:07:05] !log filippo@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM cloudcumin2001.codfw.wmnet [08:07:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11837790 (10MoritzMuehlenhoff) [08:07:27] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837791 (10ops-monitoring-bot) VM cloudcumin2001.codfw.wmnet rebooted by filippo@cumin1003 with reason: None [08:09:17] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:09:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:11:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:13:06] !log filippo@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cloudcumin2001.codfw.wmnet [08:14:02] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2188.codfw.wmnet [08:14:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419961)', diff saved to https://phabricator.wikimedia.org/P91144 and previous config saved to /var/cache/conftool/dbconfig/20260420-081416-fceratto.json [08:14:37] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2188.codfw.wmnet [08:14:46] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11837828 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 depool for host wikikube-worker2188.codfw.wmnet completed: - wikikube-worker21... [08:15:16] jmm@cumin2002 decommission (PID 198689) is awaiting input [08:15:19] !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on wikikube-worker2188.codfw.wmnet with reason: dcops intervention [08:15:24] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11837829 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=763d93ce-3c2a-432a-9965-5b1307189ea7) set by cgoubert@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their... [08:15:35] !log filippo@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM cloudcumin1001.eqiad.wmnet [08:15:42] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11837832 (10Clement_Goubert) Depooled and downtimed for 30 days, all yours. [08:15:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P91145 and previous config saved to /var/cache/conftool/dbconfig/20260420-081547-fceratto.json [08:15:56] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837833 (10ops-monitoring-bot) VM cloudcumin1001.eqiad.wmnet rebooted by filippo@cumin1003 with reason: None [08:17:16] (03PS1) 10Ayounsi: Comment out eqsin Atlas Anchor [puppet] - 10https://gerrit.wikimedia.org/r/1275261 (https://phabricator.wikimedia.org/T421863) [08:17:43] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837840 (10fgiunchedi) [08:19:31] !log filippo@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cloudcumin1001.eqiad.wmnet [08:22:08] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1275261 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:22:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:22:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:22:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:22:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast5004.wikimedia.org [08:23:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11837847 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast5004.wikimedia.org` - bast5004.wikimedia.org (**PASS**)... [08:23:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb1015.eqiad.wmnet,db[1155,1165].eqiad.wmnet with reason: Reimage to Trixie [08:24:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P91146 and previous config saved to /var/cache/conftool/dbconfig/20260420-082424-fceratto.json [08:24:26] (03PS1) 10Marostegui: db1165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275262 [08:25:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T419635)', diff saved to https://phabricator.wikimedia.org/P91147 and previous config saved to /var/cache/conftool/dbconfig/20260420-082555-fceratto.json [08:26:00] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:26:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:26:26] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2004.codfw.wmnet [08:26:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2229: after reimage to trixie [08:26:35] (03CR) 10Ayounsi: [C:03+2] Comment out eqsin Atlas Anchor [puppet] - 10https://gerrit.wikimedia.org/r/1275261 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:26:45] (03CR) 10Marostegui: [C:03+2] db1165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275262 (owner: 10Marostegui) [08:27:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1165.eqiad.wmnet with reason: Reimage to Trixie [08:27:11] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1165: Reimage to Trixie [08:27:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1165: Reimage to Trixie [08:28:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1165.eqiad.wmnet with OS trixie [08:30:25] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:30:41] !log ayounsi@cumin1003 START - Cookbook sre.hosts.decommission for hosts atlas5001.wikimedia.org [08:30:48] (03CR) 10Federico Ceratto: "The change is also updating the regex "node /^apus-be100[46789]\.eqiad\./ {" making it more selective, is it intended?" [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [08:31:45] (03CR) 10Muehlenhoff: [C:03+2] Remove bast5004 [puppet] - 10https://gerrit.wikimedia.org/r/1275258 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [08:32:16] (03PS1) 10Marostegui: eqiad.yaml: Add clouddb1024 [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) [08:32:26] (03CR) 10MVernon: "Yes - it removes apus-be1005 from it (otherwise it would be matched twice)." [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [08:32:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:32:48] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts testvm2004.codfw.wmnet [08:32:55] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837877 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2004.codfw.wmnet` - testvm2004.codfw.wmnet (... [08:33:28] (03CR) 10Marostegui: "s4 should be ready to start getting in the LB. s6 would be ready tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [08:34:17] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P91150 and previous config saved to /var/cache/conftool/dbconfig/20260420-083432-fceratto.json [08:34:47] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:36:27] (03CR) 10Ayounsi: [C:03+1] Update IP resolve spec test to use bast1004 instead of bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/1275257 (https://phabricator.wikimedia.org/T423673) (owner: 10Muehlenhoff) [08:37:50] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837939 (10MoritzMuehlenhoff) [08:38:08] FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:39:00] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:39:17] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:39:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2005.codfw.wmnet [08:39:49] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [08:39:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T419635)', diff saved to https://phabricator.wikimedia.org/P91151 and previous config saved to /var/cache/conftool/dbconfig/20260420-083957-fceratto.json [08:40:01] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:40:03] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11837945 (10ops-monitoring-bot) VM testvm2005.codfw.wmnet rebooted by jmm@cumin2002 with reason: None [08:40:29] (03PS1) 10Elukey: profile::cumin: update insetup_role_report.py [puppet] - 10https://gerrit.wikimedia.org/r/1275345 [08:41:05] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003" [08:41:44] (03CR) 10MVernon: "(to address the lack of apus100[1,2] in the regex - we never used those hostnames (nor will we), because they were called moss-be100[1,2])" [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [08:41:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003" [08:41:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:41:49] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts atlas5001.wikimedia.org [08:41:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11837951 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1003 for hosts: `atlas5001.wikimedia.org` - atlas5001.wikimedia.org (**WARN**) - //Host not f... [08:41:58] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1165.eqiad.wmnet with reason: host reimage [08:42:01] (03PS1) 10Marostegui: Revert "db1165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275348 [08:42:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T419635)', diff saved to https://phabricator.wikimedia.org/P91152 and previous config saved to /var/cache/conftool/dbconfig/20260420-084209-fceratto.json [08:42:48] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804#11837956 (10MoritzMuehlenhoff) [08:43:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2005.codfw.wmnet [08:44:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:44:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419961)', diff saved to https://phabricator.wikimedia.org/P91153 and previous config saved to /var/cache/conftool/dbconfig/20260420-084440-fceratto.json [08:45:04] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [08:45:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T419961)', diff saved to https://phabricator.wikimedia.org/P91154 and previous config saved to /var/cache/conftool/dbconfig/20260420-084512-fceratto.json [08:47:49] (03CR) 10Blake: "Ah, I meant to rewrite those after adding the runbook, thanks for the catch. I've updated them to instead state impact, rather than sugges" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [08:48:07] (03PS16) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) [08:48:29] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2006.codfw.wmnet [08:48:37] (03CR) 10FNegri: eqiad.yaml: Add clouddb1024 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [08:49:17] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11837972 (10AnnieKim_WMDE) Uploaded my ssh public key, waiting to be added to groups. [08:49:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1165.eqiad.wmnet with reason: host reimage [08:50:56] (03CR) 10Marostegui: eqiad.yaml: Add clouddb1024 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [08:51:57] (03PS2) 10Muehlenhoff: Update IP resolve spec test to use bast1004 instead of bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/1275257 (https://phabricator.wikimedia.org/T423673) [08:52:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P91155 and previous config saved to /var/cache/conftool/dbconfig/20260420-085217-fceratto.json [08:53:08] FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:53:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419961)', diff saved to https://phabricator.wikimedia.org/P91156 and previous config saved to /var/cache/conftool/dbconfig/20260420-085349-fceratto.json [08:56:54] (03CR) 10Muehlenhoff: [C:03+2] Update IP resolve spec test to use bast1004 instead of bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/1275257 (https://phabricator.wikimedia.org/T423673) (owner: 10Muehlenhoff) [08:59:12] (03CR) 10Kosta Harlan: [C:03+1] maintain-views: Hide blocks with bl_deleted set to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1273781 (https://phabricator.wikimedia.org/T414188) (owner: 10Dreamy Jazz) [08:59:36] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11838027 (10DPogorzelski-WMF) 05Open→03Resolved [09:01:33] (03CR) 10Marostegui: [C:03+2] Revert "db1165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275348 (owner: 10Marostegui) [09:01:35] (03CR) 10Ayounsi: [C:03+1] "perfecto, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [09:01:45] 06SRE, 10Lift-Wing, 06Machine-Learning-Team: Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11838029 (10DPogorzelski-WMF) [09:02:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11838031 (10ayounsi) [09:02:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P91157 and previous config saved to /var/cache/conftool/dbconfig/20260420-090225-fceratto.json [09:04:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P91158 and previous config saved to /var/cache/conftool/dbconfig/20260420-090401-fceratto.json [09:07:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2007.codfw.wmnet [09:07:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:08:26] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838039 (10ops-monitoring-bot) VM testvm2007.codfw.wmnet rebooted by jmm@cumin2002 with reason: None [09:10:57] !log trueg@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:11:01] !log trueg@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:11:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2007.codfw.wmnet [09:12:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T419635)', diff saved to https://phabricator.wikimedia.org/P91159 and previous config saved to /var/cache/conftool/dbconfig/20260420-091233-fceratto.json [09:12:37] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:13:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [09:13:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T419635)', diff saved to https://phabricator.wikimedia.org/P91160 and previous config saved to /var/cache/conftool/dbconfig/20260420-091310-fceratto.json [09:13:19] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11838048 (10isarantopoulos) [09:13:46] jmm@cumin2002 decommission (PID 226183) is awaiting input [09:13:49] !log trueg@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:13:53] !log trueg@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:14:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1165.eqiad.wmnet with OS trixie [09:14:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P91161 and previous config saved to /var/cache/conftool/dbconfig/20260420-091409-fceratto.json [09:15:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T419635)', diff saved to https://phabricator.wikimedia.org/P91162 and previous config saved to /var/cache/conftool/dbconfig/20260420-091522-fceratto.json [09:16:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1165: after reimage to trixie [09:18:09] (03CR) 10Tiziano Fogli: istio: revisit Prometheus buckets for Wikikube (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [09:18:24] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [09:19:13] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:21:15] !nowandnext [09:21:24] jouncebot: nowandnext [09:21:25] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [09:21:25] In 0 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1000) [09:21:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2008.wikimedia.org [09:21:37] this is getting embarrassing xD [09:21:56] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [09:21:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838081 (10ops-monitoring-bot) VM testvm2008.wikimedia.org rebooted by jmm@cumin2002 with reason: None [09:22:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:22:25] (03CR) 10JMeybohm: [C:03+1] "Great, thanks! I'll add this to the K8s SIG agenda so the other cluster maintainers can decide whether they would like to route their aler" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [09:23:08] (03CR) 10Blake: [C:03+2] kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [09:24:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:24:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:24:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2006.codfw.wmnet [09:24:17] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838086 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2006.codfw.wmnet` - testvm2006.codfw.wmnet (... [09:24:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419961)', diff saved to https://phabricator.wikimedia.org/P91164 and previous config saved to /var/cache/conftool/dbconfig/20260420-092417-fceratto.json [09:24:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [09:24:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T419961)', diff saved to https://phabricator.wikimedia.org/P91165 and previous config saved to /var/cache/conftool/dbconfig/20260420-092448-fceratto.json [09:24:52] (03Merged) 10jenkins-bot: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [09:25:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [09:25:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2008.wikimedia.org [09:25:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P91166 and previous config saved to /var/cache/conftool/dbconfig/20260420-092530-fceratto.json [09:25:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11838091 (10ops-monitoring-bot) Draining ganeti5006.eqsin.wmnet of running VMs [09:26:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [09:26:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [09:26:54] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [09:26:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11838092 (10ops-monitoring-bot) Draining ganeti5006.eqsin.wmnet of running VMs [09:27:40] (03PS1) 10Dpogorzelski: ml-serve: remove excludeIPRanges from cni config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275354 (https://phabricator.wikimedia.org/T419722) [09:28:01] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838096 (10MoritzMuehlenhoff) [09:29:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM puppetboard2003.codfw.wmnet [09:29:46] (03CR) 10MVernon: [C:03+2] apus: move eqiad controller moss-be1001 -> apus-be1005 [puppet] - 10https://gerrit.wikimedia.org/r/1275260 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [09:29:59] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838115 (10ops-monitoring-bot) VM puppetboard2003.codfw.wmnet rebooted by jmm@cumin2002 with reason: None [09:32:57] (03PS1) 10Klausman: ml-services: Allow LLM workloads to work on ml-serve1013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275356 [09:33:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetboard2003.codfw.wmnet [09:33:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419961)', diff saved to https://phabricator.wikimedia.org/P91168 and previous config saved to /var/cache/conftool/dbconfig/20260420-093337-fceratto.json [09:33:51] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:34:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:35:01] (03CR) 10Arthur taylor: Enable and configure WikiProjects prototype on WikiData beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [09:35:05] !log kamila@deploy1003 Started scap sync-world: ICU 72 upgrade [09:35:06] !ack [09:35:07] !ack [09:35:07] 7855 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [09:35:08] All incidents are already acked. [09:35:08] (03PS1) 10Phuedx: PHP SDK: Split measurement of unknown experiments [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275357 (https://phabricator.wikimedia.org/T422112) [09:35:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275357 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [09:35:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P91169 and previous config saved to /var/cache/conftool/dbconfig/20260420-093538-fceratto.json [09:36:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM puppetboard1003.eqiad.wmnet [09:36:35] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:36:51] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:36:56] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838189 (10ops-monitoring-bot) VM puppetboard1003.eqiad.wmnet rebooted by jmm@cumin2002 with reason: None [09:37:20] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Allow LLM workloads to work on ml-serve1013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275356 (owner: 10Klausman) [09:38:01] marostegui: do you happen to have any idea how long we ought to wait to see if this resolves itself? [09:38:01] starting ICU 72 upgrade, a bit early so I have enough time to test [09:38:08] FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:20] (03CR) 10Klausman: [C:03+2] ml-services: Allow LLM workloads to work on ml-serve1013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275356 (owner: 10Klausman) [09:38:21] bjensen: I don't think we have to assume things would resolve on their own, check -sre [09:38:27] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update to 1.3.5 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272777 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [09:38:41] ah, i was reading from https://wikitech.wikimedia.org/wiki/Thanos#Service_thanos-query:443_has_failed_probes [09:39:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:19] (03Merged) 10jenkins-bot: ml-services: Allow LLM workloads to work on ml-serve1013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275356 (owner: 10Klausman) [09:40:20] bjensen: I guess we got lucky this, time but in general I'd investigate [09:40:25] which is what I was doing :) [09:40:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetboard1003.eqiad.wmnet [09:40:50] hm, i would prefer then that the docs not say that the situation often self-resolves [09:41:01] (03Merged) 10jenkins-bot: mcrouter: update to 1.3.5 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272777 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [09:41:13] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [09:41:28] imo the alerting threshold should be adjusted if there are times where this can fire and we might not need to look at it [09:41:40] bjensen: absolutely yeah [09:42:03] bjensen: maybe we need a task to re-evaluate thresholds there [09:42:07] tappof: would that make sense ^? [09:42:18] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [09:42:29] (03CR) 10CI reject: [V:04-1] mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [09:42:44] (03PS3) 10Effie Mouzeli: mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) [09:43:00] !log ceph orch host drain moss-be1001 T418901 [09:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:03] T418901: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901 [09:43:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P91170 and previous config saved to /var/cache/conftool/dbconfig/20260420-094345-fceratto.json [09:44:24] (03PS2) 10Effie Mouzeli: (WIP) update mcrouter module to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273739 [09:45:32] (03PS1) 10Klausman: ml-services: Addendum to "Allow LLM workloads to work on ml-serve1013" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275359 [09:45:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T419635)', diff saved to https://phabricator.wikimedia.org/P91171 and previous config saved to /var/cache/conftool/dbconfig/20260420-094546-fceratto.json [09:45:51] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:45:53] (03CR) 10Klausman: [V:03+2 C:03+2] ml-services: Addendum to "Allow LLM workloads to work on ml-serve1013" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275359 (owner: 10Klausman) [09:46:04] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [09:46:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T419635)', diff saved to https://phabricator.wikimedia.org/P91172 and previous config saved to /var/cache/conftool/dbconfig/20260420-094612-fceratto.json [09:46:35] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:46:53] (03CR) 10Muehlenhoff: [C:03+2] Switch Cloud VPS to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1273441 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [09:47:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:51] (03Merged) 10jenkins-bot: ml-services: Addendum to "Allow LLM workloads to work on ml-serve1013" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275359 (owner: 10Klausman) [09:48:20] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [09:48:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T419635)', diff saved to https://phabricator.wikimedia.org/P91174 and previous config saved to /var/cache/conftool/dbconfig/20260420-094823-fceratto.json [09:49:24] (03CR) 10FNegri: eqiad.yaml: Add clouddb1024 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [09:50:50] (03PS4) 10Effie Mouzeli: mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) [09:51:18] (03CR) 10Marostegui: eqiad.yaml: Add clouddb1024 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [09:51:27] (03CR) 10Effie Mouzeli: [C:03+1] mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [09:51:39] (03PS1) 10Marostegui: Revert "site.pp: Move clouddb1024 to analytics" [puppet] - 10https://gerrit.wikimedia.org/r/1275364 [09:52:03] !log kamila@deploy1003 kamila: ICU 72 upgrade synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:53:35] (03PS2) 10Marostegui: eqiad.yaml: Add clouddb1025 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) [09:53:51] (03CR) 10Marostegui: [C:03+2] Revert "site.pp: Move clouddb1024 to analytics" [puppet] - 10https://gerrit.wikimedia.org/r/1275364 (owner: 10Marostegui) [09:53:52] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275345 (owner: 10Elukey) [09:53:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P91175 and previous config saved to /var/cache/conftool/dbconfig/20260420-095354-fceratto.json [09:55:17] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11838253 (10MoritzMuehlenhoff) [09:55:24] (03PS2) 10Marostegui: cloudb1025: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) [09:55:56] (03PS3) 10Marostegui: cloudb1025: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) [09:56:27] (03CR) 10Marostegui: "Reverted the site.pp patch and updated https://gerrit.wikimedia.org/r/1273785" [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [09:58:31] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast1003.wikimedia.org [09:58:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P91176 and previous config saved to /var/cache/conftool/dbconfig/20260420-095831-fceratto.json [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1000) [10:02:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1165: after reimage to trixie [10:02:34] !log ceph orch host drain moss-be1002 T418901 [10:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:38] T418901: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901 [10:04:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419961)', diff saved to https://phabricator.wikimedia.org/P91178 and previous config saved to /var/cache/conftool/dbconfig/20260420-100402-fceratto.json [10:04:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [10:04:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T419961)', diff saved to https://phabricator.wikimedia.org/P91179 and previous config saved to /var/cache/conftool/dbconfig/20260420-100423-fceratto.json [10:06:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270882 (https://phabricator.wikimedia.org/T417690) (owner: 10D3r1ck01) [10:07:43] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:07:59] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:08:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P91180 and previous config saved to /var/cache/conftool/dbconfig/20260420-100839-fceratto.json [10:08:59] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:10:24] (03PS3) 10Effie Mouzeli: mw-mcrouter: update mcrouter module to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273739 (https://phabricator.wikimedia.org/T421360) [10:13:28] jmm@cumin2002 decommission (PID 272402) is awaiting input [10:14:23] !log kamila@deploy1003 kamila: Continuing with sync [10:14:42] (03PS1) 10MVernon: hiera: remove two old apus backends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1275366 (https://phabricator.wikimedia.org/T418901) [10:15:54] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:17:25] (03PS1) 10Muehlenhoff: Remove bast1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275367 (https://phabricator.wikimedia.org/T423673) [10:18:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T419635)', diff saved to https://phabricator.wikimedia.org/P91181 and previous config saved to /var/cache/conftool/dbconfig/20260420-101847-fceratto.json [10:18:54] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:18:59] jmm@cumin2002 decommission (PID 272402) is awaiting input [10:19:05] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [10:19:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T419635)', diff saved to https://phabricator.wikimedia.org/P91182 and previous config saved to /var/cache/conftool/dbconfig/20260420-101913-fceratto.json [10:21:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T419635)', diff saved to https://phabricator.wikimedia.org/P91183 and previous config saved to /var/cache/conftool/dbconfig/20260420-102125-fceratto.json [10:24:05] (03CR) 10Muehlenhoff: [C:03+2] Remove bast1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275367 (https://phabricator.wikimedia.org/T423673) (owner: 10Muehlenhoff) [10:24:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:24:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:24:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast1003.wikimedia.org [10:26:27] !log kamila@deploy1003 Finished scap sync-world: ICU 72 upgrade (duration: 51m 35s) [10:27:36] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11838300 (10VRiley-WMF) a:03VRiley-WMF [10:31:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P91184 and previous config saved to /var/cache/conftool/dbconfig/20260420-103133-fceratto.json [10:31:59] marostegui: bjensen There could be another page, sorry.. While doing a test, I refreshed a dashboard and I think it's the "bad" one. [10:32:14] gotcha, thanks [10:32:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11838303 (10VRiley-WMF) [10:32:30] tappof: got it thanks [10:32:52] !log trueg@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [10:32:56] !log trueg@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [10:33:08] FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:33:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11838306 (10VRiley-WMF) @BTullis Hey Ben, we can replace this cable in order to clear up this error. Can y... [10:33:44] (03PS1) 10Muehlenhoff: Remove ganeti5006 from ganeti01 eqsin cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275369 (https://phabricator.wikimedia.org/T421863) [10:36:37] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission bast1003 - https://phabricator.wikimedia.org/T423673#11838327 (10MoritzMuehlenhoff) [10:38:08] FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:41:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P91185 and previous config saved to /var/cache/conftool/dbconfig/20260420-104141-fceratto.json [10:45:43] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:47:21] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Confirmed unused in wmf.24:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01) [10:47:26] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [10:47:56] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275372 [10:49:43] (03Merged) 10jenkins-bot: mcrouter: update to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [10:50:21] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:50:21] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:51:08] (03CR) 10Ayounsi: [C:03+1] Remove ganeti5006 from ganeti01 eqsin cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275369 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [10:51:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T419635)', diff saved to https://phabricator.wikimedia.org/P91186 and previous config saved to /var/cache/conftool/dbconfig/20260420-105148-fceratto.json [10:51:54] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:52:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [10:52:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T419635)', diff saved to https://phabricator.wikimedia.org/P91187 and previous config saved to /var/cache/conftool/dbconfig/20260420-105213-fceratto.json [10:55:43] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:56:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11838392 (10VRiley-WMF) Understood, thank you for the heads up! @Marostegui [11:01:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11838401 (10VRiley-WMF) [11:06:54] (03PS2) 10Muehlenhoff: firewall::service: Add a new parameter public_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) [11:08:49] (03PS1) 10Effie Mouzeli: mw-mcrouter: bump image and new config (codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) [11:10:23] (03PS2) 10Effie Mouzeli: mw-mcrouter: bump image and new config (codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) [11:11:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [11:16:09] (03PS1) 10JMeybohm: Add wikikube-worker refreshes [puppet] - 10https://gerrit.wikimedia.org/r/1275377 (https://phabricator.wikimedia.org/T418257) [11:16:15] (03CR) 10Federico Ceratto: "I tested parsercache and worked, not tested depool yet but it's pretty much the same." [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto) [11:16:49] (03PS3) 10Muehlenhoff: firewall::service: Add a new parameter public_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) [11:17:25] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1025.eqiad.wmnet,service=x4 [11:17:55] (03CR) 10FNegri: [C:03+1] cloudb1025: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [11:19:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [11:21:13] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: SystemdUnitFailed (instance sretest2010:9100) - https://phabricator.wikimedia.org/T423856 (10LSobanski) 03NEW [11:24:45] (03CR) 10FNegri: [C:03+1] "I depooled x4 from clouddb1025, this can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [11:27:18] (03CR) 10FNegri: "@marostegui@wikimedia.org this was on hold because of the mariadb issue, but now we can merge it." [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri) [11:30:57] (03PS4) 10Muehlenhoff: firewall::service: Add a new parameter public_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) [11:31:16] (03CR) 10Effie Mouzeli: [C:03+1] Add wikikube-worker refreshes [puppet] - 10https://gerrit.wikimedia.org/r/1275377 (https://phabricator.wikimedia.org/T418257) (owner: 10JMeybohm) [11:33:24] (03CR) 10JMeybohm: [C:03+2] Add wikikube-worker refreshes [puppet] - 10https://gerrit.wikimedia.org/r/1275377 (https://phabricator.wikimedia.org/T418257) (owner: 10JMeybohm) [11:35:10] (03CR) 10Marostegui: [C:03+1] sre.mysql.pool: Handle private tasks exception [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto) [11:36:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [11:37:40] jouncebot: nowandnext [11:37:40] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [11:37:40] In 1 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1300) [11:41:21] (03PS5) 10Muehlenhoff: firewall::service: Add a new parameter public_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) [11:49:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [11:50:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [11:52:05] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275366 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [11:52:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T419635)', diff saved to https://phabricator.wikimedia.org/P91188 and previous config saved to /var/cache/conftool/dbconfig/20260420-115231-fceratto.json [11:52:35] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:53:46] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.pool: Handle private tasks exception [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto) [11:55:05] (03CR) 10Marostegui: [C:03+2] eqiad.yaml: Add clouddb1025 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/1275286 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [11:57:51] (03CR) 10Muehlenhoff: "Good thing you prodded me for that, there was actually more things to fix... PCC now added and looking fine." [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [11:59:50] (03CR) 10MVernon: [C:03+2] hiera: remove two old apus backends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1275366 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [12:02:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P91189 and previous config saved to /var/cache/conftool/dbconfig/20260420-120239-fceratto.json [12:05:06] (03PS1) 10MVernon: preseed: increase size of / for thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/1275384 (https://phabricator.wikimedia.org/T423690) [12:10:56] !log installing edk2 security updates [12:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts moss-be[1001-1002].eqiad.wmnet [12:12:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P91190 and previous config saved to /var/cache/conftool/dbconfig/20260420-121247-fceratto.json [12:14:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission bast1003 - https://phabricator.wikimedia.org/T423673#11838644 (10VRiley-WMF) a:03VRiley-WMF [12:15:36] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1328-1334,1360-1374].eqiad.wmnet [12:15:41] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1328-1334,1360-1374].eqiad.wmnet [12:16:31] !log remove ganeti5006 from eqsin01 Ganeti cluster (running classic Ganeti) T421863 [12:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:35] T421863: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863 [12:16:44] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1328-1334,1360-1374].eqiad.wmnet - https://phabricator.wikimedia.org/T423863 (10JMeybohm) 03NEW [12:17:15] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1328-1334,1360-1374].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11838659 (10JMeybohm) [12:17:30] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti5006 from ganeti01 eqsin cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275369 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [12:17:32] !log Deployed patch for T423821 [12:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:24] PROBLEM - ganeti-noded running on ganeti5006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:19:24] PROBLEM - ganeti-confd running on ganeti5006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:20:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission bast1003 - https://phabricator.wikimedia.org/T423673#11838678 (10VRiley-WMF) 05Open→03Resolved [12:21:32] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11838680 (10JMeybohm) [12:22:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T419635)', diff saved to https://phabricator.wikimedia.org/P91191 and previous config saved to /var/cache/conftool/dbconfig/20260420-122256-fceratto.json [12:23:03] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:23:08] FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:23:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance [12:23:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T419635)', diff saved to https://phabricator.wikimedia.org/P91192 and previous config saved to /var/cache/conftool/dbconfig/20260420-122321-fceratto.json [12:25:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T419635)', diff saved to https://phabricator.wikimedia.org/P91193 and previous config saved to /var/cache/conftool/dbconfig/20260420-122534-fceratto.json [12:25:53] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275393 [12:26:38] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [12:28:13] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096,1098-1112,1166-1168].eqiad.wmnet [12:31:07] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-be[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [12:32:55] (03CR) 10Btullis: [C:03+2] maintain-views: Hide blocks with bl_deleted set to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1273781 (https://phabricator.wikimedia.org/T414188) (owner: 10Dreamy Jazz) [12:34:12] mvernon@cumin2002 decommission (PID 356509) is awaiting input [12:35:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P91194 and previous config saved to /var/cache/conftool/dbconfig/20260420-123542-fceratto.json [12:35:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-be[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [12:35:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:35:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts moss-be[1001-1002].eqiad.wmnet [12:35:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11838718 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `moss-be[1001-1002].eqiad.wmnet` - moss-be1... [12:45:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P91195 and previous config saved to /var/cache/conftool/dbconfig/20260420-124550-fceratto.json [12:53:08] FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:53:28] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11838769 (10jijiki) Thank you! [12:54:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:55:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T419635)', diff saved to https://phabricator.wikimedia.org/P91196 and previous config saved to /var/cache/conftool/dbconfig/20260420-125559-fceratto.json [12:56:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:56:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance [12:56:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1236 (T419635)', diff saved to https://phabricator.wikimedia.org/P91197 and previous config saved to /var/cache/conftool/dbconfig/20260420-125624-fceratto.json [12:58:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T419635)', diff saved to https://phabricator.wikimedia.org/P91198 and previous config saved to /var/cache/conftool/dbconfig/20260420-125837-fceratto.json [12:58:48] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096,1098-1112,1166-1168].eqiad.wmnet [12:58:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11838793 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node... [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1300). Please do the needful. [13:00:05] xSavitar, aude, and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] hi [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:42] o/ [13:02:55] (03CR) 10Elukey: [C:03+1] mw-mcrouter: update mcrouter module to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273739 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [13:04:54] Getting an error when trying to get a scap OTP [13:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:04:58] ssh: Could not resolve hostname bast1003.wikimedia.org: nodename nor servname provided, or not known [13:04:59] Connection closed by UNKNOWN port 65535 [13:05:08] Did anything change of recent? [13:05:24] xSavitar: https://lists.wikimedia.org/hyperkitty/list/ops@lists.wikimedia.org/thread/DQ7KFORXBZQX55NR23QHZDNFOSXETLQV/ [13:05:42] taavi, thanks, having a quick read now. [13:05:49] (03CR) 10Elukey: "Thinking out loud - would it be better to add one option at the time, incrementally? For example, we could start with the 10 timeouts unti" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [13:06:05] taavi, mailing list is private and I'm not subscribed :( [13:06:10] (03CR) 10Elukey: [C:03+1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275372 (owner: 10Muehlenhoff) [13:06:15] o/ [13:06:34] (03CR) 10Elukey: [C:03+2] profile::cumin: update insetup_role_report.py [puppet] - 10https://gerrit.wikimedia.org/r/1275345 (owner: 10Elukey) [13:06:43] Lucas_WMDE o/ [13:06:44] xSavitar: you need bast1004 [13:06:56] bast1003 was retired (there’s probably a phab task but it’s not linked in the email AFAITC) [13:06:58] I wanted to self service but I would need some help so that I setup bast1004 later [13:07:03] Thanks! [13:07:10] okay, I can deploy [13:07:18] Thank you very much! [13:07:30] It's a no-op config patch. The config setting should be unused now [13:07:31] might as well do the two config changes together, I think [13:07:37] Ack! [13:07:46] (FYI aude ^) [13:08:04] i'm ready [13:08:25] hm, scap complains about dependencies *looks* [13:08:35] “but the dependency is not present in recent train branch: wmf/1.46.0-wmf.23” [13:08:43] “This branch is a likely rollback target” not sure I disagree, it’s Monday [13:08:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P91199 and previous config saved to /var/cache/conftool/dbconfig/20260420-130845-fceratto.json [13:08:47] not sure I *agree [13:09:36] (03CR) 10Elukey: [C:03+1] "I am ok to proceed, but was this tested in staging with a kill/start of a pod etc..? Just to be sure that we are not getting into some wei" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275354 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:09:54] yeah, nah, let’s deploy [13:09:58] o/ [13:10:02] Sorry I'm late [13:10:22] no problem, we’re starting with the config changes now [13:10:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01) [13:10:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273842 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [13:11:27] (03Merged) 10jenkins-bot: Remove unused JWT for bot password temporary config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01) [13:11:44] We need to update spider pig website [13:11:54] (03Merged) 10jenkins-bot: Enable ReadingLists beta feature for all Wikipedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1273842 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [13:12:01] It still references bast1003 to get an OTP [13:12:25] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1247960|Remove unused JWT for bot password temporary config (T422367 T415007)]], [[gerrit:1273842|Enable ReadingLists beta feature for all Wikipedia wikis (T420881)]] [13:12:34] T422367: Remove temporary JWT session configuration setting for BotPasswords - https://phabricator.wikimedia.org/T422367 [13:12:35] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [13:12:35] T420881: [Reading list web beta] Deploy beta feature to all wikipedias - https://phabricator.wikimedia.org/T420881 [13:12:38] huh, ok [13:13:18] on which page exactly? [13:13:36] taavi: I can see it in otpHost in https://spiderpig.wikimedia.org/api/whoami [13:13:46] (I think that’‘s where web/src/components/LoginPage.vue then gets it from) [13:13:53] wait [13:13:55] deploy1003, not bast1003 [13:13:58] taavi, after one logs in [13:14:09] https://spiderpig.wikimedia.org/ (after logging in) [13:14:24] I'm filing a task about it now [13:14:28] xSavitar: are you sure it’s telling you which *bastion* to use? (I got confused between bast1003 and deploy1003 just now) [13:15:17] PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:35] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, aude, d3r1ck01: Backport for [[gerrit:1247960|Remove unused JWT for bot password temporary config (T422367 T415007)]], [[gerrit:1273842|Enable ReadingLists beta feature for all Wikipedia wikis (T420881)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:15:46] aude: please test :) [13:15:47] checking [13:15:49] xSavitar: anything to test? [13:15:51] Lucas_WMDE, ack! you're right. [13:15:55] Lucas_WMDE, nothing to test. [13:16:00] ok [13:16:41] looks good [13:16:52] taavi, I don't think I need to file it after all, this looks like my problem to resolve. Thanks! I believe deploy1003 should work. [13:16:59] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, aude, d3r1ck01: Continuing with sync [13:17:00] ok! [13:18:43] Lucas_WMDE, taavi, I was able to get the OTP (after adjusting SSH config). Works now, thanks to you both! [13:18:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P91200 and previous config saved to /var/cache/conftool/dbconfig/20260420-131853-fceratto.json [13:19:35] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275357 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [13:19:39] xSavitar: okay, great! [13:19:50] FIRING: KubernetesCalicoDown: ml-serve1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1012.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:20:46] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1247960|Remove unused JWT for bot password temporary config (T422367 T415007)]], [[gerrit:1273842|Enable ReadingLists beta feature for all Wikipedia wikis (T420881)]] (duration: 08m 21s) [13:20:49] (03CR) 10Marostegui: "Works for me!" [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri) [13:20:53] T422367: Remove temporary JWT session configuration setting for BotPasswords - https://phabricator.wikimedia.org/T422367 [13:20:53] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [13:20:54] T420881: [Reading list web beta] Deploy beta feature to all wikipedias - https://phabricator.wikimedia.org/T420881 [13:20:56] thanks Lucas_WMDE ! [13:20:58] (03CR) 10Bking: [C:03+2] opensearch on k8s: Add semantic-search and ipoid to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [13:21:02] phuedx: want to deploy your backport yourself? [13:21:06] (03Merged) 10jenkins-bot: PHP SDK: Split measurement of unknown experiments [extensions/TestKitchen] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275357 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [13:21:06] np aude :) [13:21:21] Can do [13:21:27] alright, go ahead :) [13:21:41] I suppose there is no train this week [13:21:52] Lucas_WMDE, thanks for helping aude and me deploy. I appreciate it. 🙏🏽 [13:22:02] huh, what’s up with the train? [13:22:05] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1275357|PHP SDK: Split measurement of unknown experiments (T422112)]] [13:22:09] T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112 [13:22:12] oh, earth day https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar [13:22:52] Scrolling through https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T0200, it looks like yes, no train this week. [13:22:57] WMF staff have a holiday on Wednesday (and I do not see the train on the calendar) [13:23:41] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1275357|PHP SDK: Split measurement of unknown experiments (T422112)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:23:45] thx [13:23:45] RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [13:25:00] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1012:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [13:26:05] Quick spot check on enwiki and dewiki LGTM [13:26:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-be100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T423868 (10MatthewVernon) 03NEW [13:26:08] !log phuedx@deploy1003 phuedx: Continuing with sync [13:26:40] 10ops-codfw, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869 (10FCeratto-WMF) 03NEW [13:29:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T419635)', diff saved to https://phabricator.wikimedia.org/P91202 and previous config saved to /var/cache/conftool/dbconfig/20260420-132901-fceratto.json [13:29:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:29:19] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1253.eqiad.wmnet with reason: Maintenance [13:29:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T419635)', diff saved to https://phabricator.wikimedia.org/P91203 and previous config saved to /var/cache/conftool/dbconfig/20260420-132926-fceratto.json [13:29:50] RESOLVED: KubernetesCalicoDown: ml-serve1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1012.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:29:56] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275357|PHP SDK: Split measurement of unknown experiments (T422112)]] (duration: 07m 51s) [13:30:00] RESOLVED: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1012:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [13:30:00] T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112 [13:30:05] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1014.eqiad.wmnet with reason: Decommissioning — T412830 [13:30:09] * phuedx watches logs [13:30:11] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [13:31:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T419635)', diff saved to https://phabricator.wikimedia.org/P91204 and previous config saved to /var/cache/conftool/dbconfig/20260420-133139-fceratto.json [13:32:34] decommissioning Cassandra, aqs1014 [a,b] — T412830 [13:32:37] !log decommissioning Cassandra, aqs1014 [a,b] — T412830 [13:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:27] (03PS2) 10Bking: opensearch on k8s: Activate semantic-search and ipoid in services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1272909 (https://phabricator.wikimedia.org/T421293) [13:33:35] Lucas_WMDE: The logs look good. I think that's the end of the window? [13:34:06] * Lucas_WMDE reloads the calendar [13:34:08] looks like it yeah [13:34:09] thanks! [13:34:16] !log UTC afternoon backport+config window done [13:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:33] Quick lunch! [13:35:46] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:37:38] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:38:35] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-be100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T423868#11839006 (10Jclark-ctr) a:03Jclark-ctr [13:40:19] (03PS1) 10Daniel Kinzler: api rate limits: use global apihighlimits-requestor group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275410 (https://phabricator.wikimedia.org/T419796) [13:41:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P91205 and previous config saved to /var/cache/conftool/dbconfig/20260420-134148-fceratto.json [13:41:51] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:41:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T419961)', diff saved to https://phabricator.wikimedia.org/P91206 and previous config saved to /var/cache/conftool/dbconfig/20260420-134158-fceratto.json [13:43:06] (03CR) 10Elukey: ipmi: rework how to use a different user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [13:43:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-be100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T423868#11839045 (10Jclark-ctr) [13:43:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-be100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T423868#11839046 (10Jclark-ctr) 05Open→03Resolved [13:43:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11839049 (10Jclark-ctr) a:05Jclark-ctr→03brouberol [13:44:16] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbstore1010 to eqiad - jclark@cumin1003" [13:44:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbstore1010 to eqiad - jclark@cumin1003" [13:44:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:35] 10ops-eqiad, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872 (10phaultfinder) 03NEW [13:45:02] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host dbstore1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:46:24] 10SRE-swift-storage, 06DBA, 10MediaWiki-File-management, 07Regression: Stuck-hidden file / Deleted file revisions displaying improperly - https://phabricator.wikimedia.org/T423065#11839078 (10Bugreporter) >>! In T423065#11837057, @Zabe wrote: > Should be working again. Following up in T423821. See als... [13:47:58] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:50:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T419961)', diff saved to https://phabricator.wikimedia.org/P91207 and previous config saved to /var/cache/conftool/dbconfig/20260420-135025-fceratto.json [13:50:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:51:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P91208 and previous config saved to /var/cache/conftool/dbconfig/20260420-135155-fceratto.json [13:52:07] (03PS3) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) [13:52:32] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869#11839106 (10FCeratto-WMF) [13:52:36] jouncebot: nowandnext [13:52:36] For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1300) [13:52:36] In 0 hour(s) and 37 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1430) [13:52:47] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [13:52:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2152 (T410589)', diff saved to https://phabricator.wikimedia.org/P91209 and previous config saved to /var/cache/conftool/dbconfig/20260420-135255-ladsgroup.json [13:53:00] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [13:58:50] (03PS1) 10Marostegui: ms2: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275414 [13:59:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: Reimage to Trixie [13:59:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11839147 (10VRiley-WMF) [13:59:33] (03CR) 10Marostegui: [C:03+2] ms2: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275414 (owner: 10Marostegui) [14:00:00] (03PS1) 10Elukey: profile::pki::root_ca: create a new discovery intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275416 (https://phabricator.wikimedia.org/T420993) [14:00:02] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:00:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1151.eqiad.wmnet with reason: Reimage to Trixie [14:00:07] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1151: Reimage to Trixie [14:00:07] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [14:00:14] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275416 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:00:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:00:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1151: Reimage to Trixie [14:00:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P91211 and previous config saved to /var/cache/conftool/dbconfig/20260420-140033-fceratto.json [14:00:43] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1151.eqiad.wmnet with OS trixie [14:01:54] (03PS3) 10FNegri: conftool-data: move s3, x3 to new hosts (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) [14:02:01] !log upgrade envoyproxy, restbase — T419637 & T410975 [14:02:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T419635)', diff saved to https://phabricator.wikimedia.org/P91212 and previous config saved to /var/cache/conftool/dbconfig/20260420-140203-fceratto.json [14:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:10] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [14:02:12] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [14:02:20] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:02:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [14:02:28] (03CR) 10Majavah: [C:03+2] P:wmcs::striker: Remove separate monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/1270282 (owner: 10Majavah) [14:06:57] (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275384 (https://phabricator.wikimedia.org/T423690) (owner: 10MVernon) [14:07:38] (03CR) 10MVernon: [C:03+2] preseed: increase size of / for thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/1275384 (https://phabricator.wikimedia.org/T423690) (owner: 10MVernon) [14:09:18] 06SRE, 10envoy, 06ServiceOps new, 10ServiceOps-Services-Oids: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11839189 (10Eevans) [14:10:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P91213 and previous config saved to /var/cache/conftool/dbconfig/20260420-141042-fceratto.json [14:14:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye [14:14:22] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1151.eqiad.wmnet with reason: host reimage [14:14:38] 06SRE, 10SRE-swift-storage, 06SRE Observability, 13Patch-For-Review: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11839198 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2006.codfw.wmnet... [14:15:09] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host dbstore1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:18:45] (03PS1) 10Marostegui: Revert "ms2: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275423 [14:19:17] (03CR) 10FNegri: [C:03+2] "Rebased, merging." [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri) [14:19:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1151.eqiad.wmnet with reason: host reimage [14:19:50] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudelastic1012.eqiad.wmnet [14:20:02] 06SRE, 10observability: Observability: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T422816#11839245 (10ayounsi) [14:20:05] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: SystemdUnitFailed (instance sretest2010:9100) - https://phabricator.wikimedia.org/T423856#11839247 (10jhathaway) p:05Triage→03Medium a:03jhathaway [14:20:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T419961)', diff saved to https://phabricator.wikimedia.org/P91214 and previous config saved to /var/cache/conftool/dbconfig/20260420-142050-fceratto.json [14:21:02] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#11839250 (10ayounsi) p:05Triage→03Medium [14:21:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:21:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T419961)', diff saved to https://phabricator.wikimedia.org/P91215 and previous config saved to /var/cache/conftool/dbconfig/20260420-142120-fceratto.json [14:21:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudelastic1012.eqiad.wmnet [14:21:52] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11839254 (10LSobanski) p:05Triage→03High [14:22:41] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:22:41] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:22:41] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:23:33] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11839263 (10Scott_French) [14:26:00] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11839267 (10Scott_French) @AnnieKim_WMDE - Please see https://wikitech.wikimedia.org/wiki/SRE/Production_access#Access_Request_Process f... [14:26:52] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:26:55] jclark@cumin1003 provision (PID 3284747) is awaiting input [14:29:41] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11839301 (10BTullis) >>! In T348935#11834420, @BTullis wrote: > It's worth noting that... [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1430) [14:30:33] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869#11839308 (10Jhancock.wm) @FCeratto-WMF got it to boot. powered off, drained the flea power, and reseated the cables to the backplane. This error could have been caused by a loose cable.... [14:30:42] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869#11839309 (10Jhancock.wm) a:03Jhancock.wm [14:30:55] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11839313 (10Jclark-ctr) a:03Jclark-ctr ` ps1-c6-eqiad.mgmt.eqiad.wmnet #1: Phase, AA:L2-L3, Active Power; Value: 1662 (power) high: 1650 ` [14:32:41] 10ops-codfw, 06SRE, 10Data-Persistence-Misc, 06DC-Ops: db2201 broken DIMM - https://phabricator.wikimedia.org/T423184#11839328 (10Jhancock.wm) 05Open→03Declined [14:33:09] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869#11839335 (10FCeratto-WMF) Thanks! [14:33:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: move es2036 - https://phabricator.wikimedia.org/T423195#11839337 (10Jhancock.wm) [14:34:42] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: move es2036 - https://phabricator.wikimedia.org/T423195#11839359 (10Marostegui) Would this need an IP change? It should be fairly easy to get this host depooled, when would you like to get it done? [14:35:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbstore1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:36:53] (03CR) 10Scott French: [C:03+1] backup: Ignore /srv/docker from srv-deployment backups, move cluster mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) (owner: 10Jcrespo) [14:36:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage [14:37:15] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host dbstore1010.eqiad.wmnet with OS bookworm [14:37:16] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbstore1010.eqiad.wmnet with OS bookworm [14:38:04] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190.codfw.wmnet is not powering up - https://phabricator.wikimedia.org/T423869#11839381 (10Jhancock.wm) 05Open→03Resolved [14:38:52] (03PS1) 10Aude: Limit donate button to Wikipedia wikis (except Finnish) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) [14:40:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude) [14:40:39] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host dbstore1010.eqiad.wmnet with OS bookworm [14:40:47] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839409 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host dbstore1010.eqiad.wmnet with OS bookworm [14:41:07] (03CR) 10Anne Tomasevich: [C:03+1] Limit donate button to Wikipedia wikis (except Finnish) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude) [14:41:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839411 (10Jclark-ctr) a:03Jclark-ctr [14:41:27] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: move es2036 - https://phabricator.wikimedia.org/T423195#11839410 (10Jhancock.wm) It does not need an IP change. only needs a few values in netbox updated and running dns cookbook to catch changes. It's going to stay in the same rack. I can do this any day of the week ar... [14:42:05] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839413 (10Jclark-ctr) [14:42:06] 10ops-codfw, 06SRE, 06DC-Ops: sretest2001 has broken psu - https://phabricator.wikimedia.org/T423179#11839414 (10Jhancock.wm) 05Open→03Declined [14:42:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1151.eqiad.wmnet with OS trixie [14:43:07] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1151: after reimage to trixie [14:43:07] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db1151: after reimage to trixie [14:44:26] (03PS1) 10Harroyo-wmf: hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) [14:45:09] (03PS1) 10Effie Mouzeli: site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275430 (https://phabricator.wikimedia.org/T418261) [14:45:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage [14:45:53] !log cwhite@deploy1003 Started deploy [performance/arc-lamp@bd7b2ab]: T413127 [14:45:57] T413127: Directory Listing and Download from Object Storage - https://phabricator.wikimedia.org/T413127 [14:46:02] !log cwhite@deploy1003 Finished deploy [performance/arc-lamp@bd7b2ab]: T413127 (duration: 00m 08s) [14:47:49] (03CR) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [14:47:54] (03PS2) 10Effie Mouzeli: site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275430 (https://phabricator.wikimedia.org/T418261) [14:50:22] 10SRE-Access-Requests, 06Data-Engineering: Kindly requesting Kerberos password reset - https://phabricator.wikimedia.org/T423875#11839477 (10ssingh) [14:51:24] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2188'] [14:51:51] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622#11839483 (10Jdforrester-WMF) 05Open→03In progress a:03Jdforrester-WMF [14:52:02] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:57:20] (03PS1) 10JMeybohm: Decom various wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1275433 (https://phabricator.wikimedia.org/T42386) [14:58:11] 10SRE-Access-Requests, 06Data-Engineering: Kindly requesting Kerberos password reset - https://phabricator.wikimedia.org/T423875#11839516 (10ssingh) ` sukhe@krb1002:~$ sudo manage_principals.py reset-password mfischerwmf --email_address=mfischer@wikimedia.org Password reset successfully. Successfully sent emai... [14:58:27] 10SRE-Access-Requests, 06Data-Engineering: Kindly requesting Kerberos password reset - https://phabricator.wikimedia.org/T423875#11839517 (10ssingh) 05Open→03Resolved [14:58:29] jouncebot: now [14:58:29] For the next 0 hour(s) and 1 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1430) [14:58:35] jouncebot: next [14:58:35] In 0 hour(s) and 31 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1530) [15:03:46] !log trueg@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:03:48] !log trueg@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:04:38] (03PS1) 10Bking: cloudelastic1012: Set LVS config for opensearch_2 [puppet] - 10https://gerrit.wikimedia.org/r/1275435 (https://phabricator.wikimedia.org/T422860) [15:04:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275435 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [15:05:10] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1010.eqiad.wmnet with reason: host reimage [15:05:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2006.codfw.wmnet with OS bullseye [15:05:47] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11839575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2006.codfw.wmnet with OS bullseye completed... [15:08:29] (03CR) 10Marostegui: [C:03+2] Revert "ms2: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275423 (owner: 10Marostegui) [15:08:46] (03CR) 10Bking: [C:03+2] cloudelastic1012: Set LVS config for opensearch_2 [puppet] - 10https://gerrit.wikimedia.org/r/1275435 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [15:09:05] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1010.eqiad.wmnet with reason: host reimage [15:11:04] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1151: repool after maintenance [15:11:04] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db1151: repool after maintenance [15:11:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2188'] [15:11:40] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 68424256 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:11:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [15:11:45] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [15:12:49] (03PS1) 10Marostegui: es2036: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275437 (https://phabricator.wikimedia.org/T423195) [15:13:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2036: Moving to another rack [15:13:24] (03CR) 10Marostegui: [C:03+2] es2036: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275437 (https://phabricator.wikimedia.org/T423195) (owner: 10Marostegui) [15:13:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2036: Moving to another rack [15:13:40] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2567752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:13:45] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11839636 (10Jhancock.wm) @Clement_Goubert did a firmware and bios update. error has cleared. should be good to repool. [15:14:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Moved to anotehr rack [15:14:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11839638 (10phaultfinder) [15:16:36] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: move es2036 - https://phabricator.wikimedia.org/T423195#11839658 (10Marostegui) Host off, ready to be moved. [15:17:32] (03CR) 10Arlolra: [C:03+1] Increase Parsoid Read Views percentage for ruwiki to 55% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274387 (owner: 10C. Scott Ananian) [15:20:11] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1166: Security update [15:21:09] (03PS1) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275439 (https://phabricator.wikimedia.org/T414376) [15:21:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [15:21:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [15:21:56] I’ll deploy that ^ wikidata-query-gui bump soon if no one objects [15:23:07] (03PS1) 10Ottomata: html-enrich - try mw-api-int to get earlier envoy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275441 (https://phabricator.wikimedia.org/T421216) [15:23:33] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [15:23:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91217 and previous config saved to /var/cache/conftool/dbconfig/20260420-152341-fceratto.json [15:23:45] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:25:23] (03CR) 10Ottomata: [C:03+2] html-enrich - try mw-api-int to get earlier envoy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275441 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [15:25:46] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db1166: Security update [15:25:49] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudelastic1012.eqiad.wmnet [15:27:02] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11839734 (10MatthewVernon) Thanos-be2006 now looks like: ` Filesystem Size Used Avail Use% Mounted on /dev/md0 110G 5.7G 99G 6% / /dev/sdy4... [15:27:03] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:27:50] (03Merged) 10jenkins-bot: html-enrich - try mw-api-int to get earlier envoy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275441 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [15:28:05] (03PS1) 10JMeybohm: Decom various wikikube-workers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275442 (https://phabricator.wikimedia.org/T423863) [15:28:05] (03CR) 10JavierMonton: [C:03+1] html-enrich - try mw-api-int to get earlier envoy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275441 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [15:28:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1275416 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:28:53] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11839764 (10MatthewVernon) 05Open→03In progress [15:30:04] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1530). [15:30:09] jclark@cumin1003 reimage (PID 3304108) is awaiting input [15:33:17] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:33:33] (03PS2) 10Aude: Do not show donate button on affiliate wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) [15:34:49] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:34:53] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:35:48] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1166: Security update [15:35:56] (03CR) 10Elukey: [C:03+2] profile::pki::root_ca: create a new discovery intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275416 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:36:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:16] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:36:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1010.eqiad.wmnet with OS bookworm [15:36:21] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host es2036 [15:36:25] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839832 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host dbstore1010.eqiad.wmnet with OS bookworm completed: - dbstore1010 (**PASS**) - Removed from P... [15:36:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2036 [15:36:58] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudelastic1012.eqiad.wmnet [15:37:11] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839833 (10Jclark-ctr) [15:37:16] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11839837 (10Jclark-ctr) 05Open→03Resolved [15:37:23] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [15:37:23] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [15:38:09] (03PS1) 10Bking: Cirrussearch: remove unused hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1275444 (https://phabricator.wikimedia.org/T388607) [15:38:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275444 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [15:39:37] (03PS1) 10Ottomata: html-enrich - use mw-api-int for stream config too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275445 (https://phabricator.wikimedia.org/T421216) [15:40:05] (03CR) 10Ottomata: [V:03+2 C:03+2] html-enrich - use mw-api-int for stream config too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275445 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [15:41:15] (03CR) 10Bking: [C:03+2] Cirrussearch: remove unused hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1275444 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [15:41:26] (03CR) 10Effie Mouzeli: Decom various wikikube-workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275433 (https://phabricator.wikimedia.org/T42386) (owner: 10JMeybohm) [15:41:32] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: move es2036 - https://phabricator.wikimedia.org/T423195#11839873 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm host was moved, netbox and dns updated. mgmt and network ping. ready to go back in. @Marostegui thank you for helping us with this! [15:41:33] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:41:37] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:42:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11839880 (10Jclark-ctr) @elukey did we have a work around for the usernames? [15:42:42] (03PS1) 10Marostegui: Revert "es2036: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275447 [15:43:37] (03CR) 10Effie Mouzeli: [C:03+1] Decom various wikikube-workers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275442 (https://phabricator.wikimedia.org/T423863) (owner: 10JMeybohm) [15:45:03] (03CR) 10Marostegui: [C:03+2] Revert "es2036: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275447 (owner: 10Marostegui) [15:45:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11839891 (10elukey) @Jclark-ctr not yet, we haven't got a definitive reply from supermicro yet. I have some code patches lined up that should unblock... [15:46:27] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [15:48:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5006.eqsin.wmnet with OS bookworm [15:48:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11839919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5006.eqsin.wmnet with OS bookworm [15:49:38] (03CR) 10Ssingh: [C:03+2] varnish: trace all file uploads [puppet] - 10https://gerrit.wikimedia.org/r/1272869 (owner: 10CDanis) [15:50:10] !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for wikikube-worker2188.codfw.wmnet [15:50:10] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker2188.codfw.wmnet [15:50:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2036: Moving to another rack [15:50:49] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool es2036: Moving to another rack [15:50:56] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2188.codfw.wmnet [15:50:58] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2188.codfw.wmnet [15:51:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2036: Moving to another rack [15:51:06] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11839924 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 pool for host wikikube-worker2188.codfw.wmnet completed: - wikikube-worker2188... [15:51:18] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11839937 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert tyvm :) [15:51:22] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1273925 (owner: 10CDobbins) [15:52:15] (03CR) 10JMeybohm: [C:04-1] "(I do think it's confusing to have these two things in one change)" [puppet] - 10https://gerrit.wikimedia.org/r/1275430 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [15:53:39] (03PS2) 10JMeybohm: Decom various wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1275433 (https://phabricator.wikimedia.org/T423863) [15:53:41] (03PS2) 10JMeybohm: Decom various wikikube-workers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275442 (https://phabricator.wikimedia.org/T423863) [15:55:02] !log sudo cumin -b31 "A:cp and not P{cp2041* or cp2042*}" "run-puppet-agent --enable 'merging CR 1272869'" [15:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:53] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840036 (10LSobanski) [15:57:03] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11840038 (10MoritzMuehlenhoff) [15:57:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude) [15:57:55] !log installing libvirt security updates [15:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:08] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [16:00:10] (03CR) 10JHathaway: firewall::service: Add a new parameter public_access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [16:00:28] (03PS5) 10Jasmine: role::aux_k8s::worker: add sophroid to lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) [16:02:48] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11840075 (10Aklapper) [16:03:19] (03PS1) 10Scott French: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275449 [16:05:14] Quick question - there is change I'd like to backport today to wmf.24 -> there I need to push 4 commits. What is the best way to do it? Do I cherry-pick and push 4 different things? or can I squash them into a single commit ? [16:06:28] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=cloudelastic1012.eqiad.wmnet [16:08:07] (03CR) 10Jdlrobson: [C:03+1] Do not show donate button on affiliate wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude) [16:08:28] (03CR) 10Muehlenhoff: firewall::service: Add a new parameter public_access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [16:09:17] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:49] (03PS1) 10Ottomata: html-enrich - update values with latest settings from T421216 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275453 (https://phabricator.wikimedia.org/T421216) [16:11:16] (03CR) 10Ottomata: [C:03+2] html-enrich - update values with latest settings from T421216 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275453 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [16:13:12] (03Merged) 10jenkins-bot: html-enrich - update values with latest settings from T421216 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275453 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [16:14:56] (03CR) 10Effie Mouzeli: [C:03+1] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275449 (owner: 10Scott French) [16:16:24] (03PS1) 10Ottomata: html-enrich - set tolerable-failed-checkpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275458 (https://phabricator.wikimedia.org/T421216) [16:16:57] (03CR) 10Ottomata: [V:03+2 C:03+2] html-enrich - set tolerable-failed-checkpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275458 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [16:17:22] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840160 (10MoritzMuehlenhoff) >>! In T422596#11833600, @jcrespo wrote: > In any case, backupmon1001.eqiad.wmnet is a very very tiny instance (an apache with just 1 user- me). No pr... [16:17:30] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:17:34] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:17:51] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5006.eqsin.wmnet with reason: host reimage [16:19:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM backupmon1001.eqiad.wmnet [16:19:51] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840176 (10ops-monitoring-bot) VM backupmon1001.eqiad.wmnet rebooted by jmm@cumin2002 with reason: None [16:21:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1166: Security update [16:22:55] (03CR) 10Cathal Mooney: [C:03+1] "LGTM. The only thing that does spring to mind is the name, not sure if we might have some services on non-public vlans that SREs might wa" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [16:23:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91226 and previous config saved to /var/cache/conftool/dbconfig/20260420-162359-fceratto.json [16:24:04] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:24:38] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1275460 [16:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11840215 (10phaultfinder) [16:25:50] (03CR) 10Muehlenhoff: "Good point, maybe something along the lines of "unrestricted" instead of "public access" works better?" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [16:25:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5006.eqsin.wmnet with reason: host reimage [16:25:57] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1275460 (owner: 10Marostegui) [16:26:00] !log marostegui@dns1004 START - running authdns-update [16:26:50] !log Switchover m3 proxy (phabricator) [16:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:02] (03CR) 10JHathaway: [C:03+1] ipmi: rework how to use a different user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [16:27:27] !log marostegui@dns1004 END - running authdns-update [16:28:28] (03PS1) 10Muehlenhoff: Add ganeti5006 to the routed Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275461 (https://phabricator.wikimedia.org/T421863) [16:28:51] (03CR) 10JHathaway: [C:03+1] firewall::service: Add a new parameter public_access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [16:29:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM backupmon1001.eqiad.wmnet [16:29:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1003.eqiad.wmnet [16:29:36] (03CR) 10Dzahn: gerrit: update sync-instances cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb) [16:29:49] (03PS2) 10Herron: kafka-logging: update kafka-logging2001 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1273863 (https://phabricator.wikimedia.org/T423723) [16:29:57] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840238 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet rebooted by jmm@cumin2002 with reason: None [16:32:40] (03PS2) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) [16:33:09] (03CR) 10Elukey: ipmi: rework how to use a different user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [16:33:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1003.eqiad.wmnet [16:34:02] (03PS1) 10RLazarus: mw-wikifunctions: Remove in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) [16:34:06] (03PS3) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) [16:34:07] (03PS1) 10RLazarus: mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423311) [16:34:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91227 and previous config saved to /var/cache/conftool/dbconfig/20260420-163407-fceratto.json [16:34:17] (03CR) 10CI reject: [V:04-1] ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [16:34:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1003.eqiad.wmnet [16:35:17] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840327 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet rebooted by jmm@cumin2002 with reason: None [16:35:27] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11840329 (10herron) >>! In T423723#11837649, @elukey wrote: > @herron I would change a thing - I think it is sufficient to u... [16:35:50] (03CR) 10JHathaway: [C:03+1] ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [16:36:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2036: Moving to another rack [16:36:30] (03CR) 10Herron: [C:03+2] kafka-logging: update kafka-logging2001 confluent distro to 77 [puppet] - 10https://gerrit.wikimedia.org/r/1273863 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [16:36:36] (03CR) 10Dzahn: [C:03+2] ci::docker: only install docker-cli if on trixie or newer [puppet] - 10https://gerrit.wikimedia.org/r/1274067 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [16:37:05] (03PS1) 10RLazarus: mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275467 (https://phabricator.wikimedia.org/T423311) [16:37:21] (03PS4) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) [16:37:28] mutante: shall I go ahead and multiple? [16:38:02] herron: yes, multiple is fine. thanks! [16:38:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1003.eqiad.wmnet [16:38:59] mutante: annd done! [16:39:23] ty [16:40:13] (03CR) 10CI reject: [V:04-1] ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [16:41:16] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11840357 (10MoritzMuehlenhoff) [16:41:53] (03CR) 10Elukey: "spicerack/hosts.py: note: In member "ipmi" of class "Host":" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [16:42:54] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.x - https://phabricator.wikimedia.org/T423723#11840364 (10herron) [16:44:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91229 and previous config saved to /var/cache/conftool/dbconfig/20260420-164415-fceratto.json [16:44:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1003.eqiad.wmnet [16:45:05] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840376 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet rebooted by jmm@cumin2002 with reason: None [16:47:19] (03CR) 10JHathaway: ipmi: rework how to use a different user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [16:48:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1003.eqiad.wmnet [16:48:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5006.eqsin.wmnet with OS bookworm [16:48:43] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11840395 (10MoritzMuehlenhoff) [16:48:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11840396 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5006.eqsin.wmnet with OS bookworm completed: - ganeti5... [16:52:36] !log installing imagemagick security updates [16:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:23] FIRING: [8x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91230 and previous config saved to /var/cache/conftool/dbconfig/20260420-165423-fceratto.json [16:54:28] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:54:42] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:54:51] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:55:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P91231 and previous config saved to /var/cache/conftool/dbconfig/20260420-165459-fceratto.json [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1700) [17:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T1700). [17:00:06] (03PS1) 10Bking: cloudelastic1012: move back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1275473 (https://phabricator.wikimedia.org/T422860) [17:00:53] (03CR) 10Bking: [C:03+2] cloudelastic1012: move back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1275473 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:02:50] I'll likely be deploying some non-mediawiki changes during the infra window (need a couple of minutes to double check some unrelated diffs) [17:02:59] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie [17:05:55] (03PS5) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) [17:08:32] 06SRE: wiki.openstreetmap.org Commons thumbs rate limit allowance - https://phabricator.wikimedia.org/T423570#11840468 (10jcrespo) Let me ask, while all data I have access is already anonymous, it is still user's private data, just osm wiki is the referrer. Let me ask what parts (in any) I can disclose for peopl... [17:09:06] (03CR) 10Scott French: [C:03+2] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275449 (owner: 10Scott French) [17:10:26] (03PS6) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) [17:11:11] (03CR) 10Elukey: "@" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [17:11:29] (03Merged) 10jenkins-bot: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275449 (owner: 10Scott French) [17:14:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11840527 (10phaultfinder) [17:14:42] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [17:15:04] (03PS1) 10Pmiazga: Attribution: Clean up API spec descriptions [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275475 (https://phabricator.wikimedia.org/T422502) [17:15:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1005.eqiad.wmnet with OS bullseye [17:15:07] (03PS1) 10Pmiazga: i18n: Use {{doc-markdown}} template in Attribution qqq.json [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275476 (https://phabricator.wikimedia.org/T422502) [17:15:09] (03PS1) 10Pmiazga: Attribution: Documentation copyedits [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 [17:15:09] (03PS1) 10Pmiazga: Attribution: Add localized texts for trending param [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275478 (https://phabricator.wikimedia.org/T423541) [17:15:13] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11840529 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1005.eqiad.wmnet with OS bullseye [17:16:54] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [17:17:20] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [17:17:21] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [17:17:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275475 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga) [17:17:33] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [17:17:34] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [17:17:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275476 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga) [17:17:47] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [17:17:48] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:17:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 (owner: 10Pmiazga) [17:18:02] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:18:03] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [17:18:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275478 (https://phabricator.wikimedia.org/T423541) (owner: 10Pmiazga) [17:18:20] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [17:18:21] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [17:18:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [17:18:43] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [17:21:44] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:22:05] 10SRE-swift-storage, 06Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#11840551 (10Ladsgroup) >>! In T352744#9413282, @MoritzMuehlenhoff wrote: >>>! In T352744#9413140, @jhathaway wrote: >> wolfssl is packaged in Debian, so that may be a possible option longer term, https://... [17:22:31] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:23:02] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:23:32] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:24:04] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:24:20] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:24:51] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:25:07] (03PS1) 10Elukey: profile::pki::intermediates: add discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275479 (https://phabricator.wikimedia.org/T420993) [17:25:10] (03PS1) 10Elukey: role::pki::multiroot: configure discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) [17:25:14] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:25:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:40] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11840566 (10MatthewVernon) [17:25:45] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:25:51] (03CR) 10Elukey: "This patch also needs the correspondent secret for the private key." [puppet] - 10https://gerrit.wikimedia.org/r/1275479 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [17:26:15] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:26:22] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [17:26:47] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:27:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1005.eqiad.wmnet with reason: host reimage [17:27:43] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:28:30] (03PS1) 10Elukey: Add fake private secrets for discovery2026 PKI intermediate [labs/private] - 10https://gerrit.wikimedia.org/r/1275481 (https://phabricator.wikimedia.org/T420993) [17:28:56] 10SRE-swift-storage, 06Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#11840575 (10ssingh) >>! In T352744#11840551, @Ladsgroup wrote: >>>! In T352744#9413282, @MoritzMuehlenhoff wrote: >>>>! In T352744#9413140, @jhathaway wrote: >>> wolfssl is packaged in Debian, so that may... [17:34:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1005.eqiad.wmnet with reason: host reimage [17:35:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1012.eqiad.wmnet with OS trixie [17:35:54] (03CR) 10Alex Paskulin: [C:03+1] Attribution: Documentation copyedits [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 (owner: 10Pmiazga) [17:36:37] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:37:11] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:38:06] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:41:49] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:42:36] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:42:37] elukey@cumin1003 provision (PID 3425810) is awaiting input [17:43:08] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:43:45] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:44:17] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:44:30] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11840614 (10phaultfinder) [17:45:01] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:45:17] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:45:26] (03PS1) 10Bking: cloudelastic1012: move back to production role [puppet] - 10https://gerrit.wikimedia.org/r/1275485 (https://phabricator.wikimedia.org/T422860) [17:45:48] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [17:46:10] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:46:23] (03CR) 10Bking: [C:03+2] cloudelastic1012: move back to production role [puppet] - 10https://gerrit.wikimedia.org/r/1275485 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:46:41] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:47:54] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:47:55] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [17:48:14] ^ known - maintenance in progress [17:48:55] RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [17:51:50] new jenkins hosts will take over soon but havent just yet. WIP [17:52:05] (03CR) 10RLazarus: "James: Please review for "yep, we aren't expecting mw-mcrouter to have its own mcrouter on 127.0.0.1:11213 anymore."" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [17:52:54] (03CR) 10RLazarus: "James: Please review for whether this matches your expectations of what routes exist where (and the revised comment is up-to-date)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [17:53:15] (03CR) 10Herron: thanos/compact: avoid constant Puppet changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1273762 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [17:54:16] (03CR) 10Herron: [C:03+2] pyrra: remove configuration for web interface [puppet] - 10https://gerrit.wikimedia.org/r/1270992 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [17:56:12] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11840735 (10elukey) I was able to repro: ` 2026-04-20 17:39:37,345 elukey 3425810 [DEBUG wmflib.interactive:229 in confirm_on_failure] Traceback Traceback (most recent call la... [17:56:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [17:56:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1005.eqiad.wmnet with OS bullseye [17:56:58] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11840743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1005.eqiad.wmnet with OS bullseye completed... [17:59:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11840747 (10phaultfinder) [18:05:15] (03CR) 10Herron: [C:03+2] pyrra: remove pyrra/slo/slos dns entries [dns] - 10https://gerrit.wikimedia.org/r/1270995 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [18:05:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1272832 (owner: 10JHathaway) [18:05:37] !log herron@dns1004 START - running authdns-update [18:06:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [18:07:08] !log herron@dns1004 END - running authdns-update [18:09:02] 10SRE-SLO, 13Patch-For-Review: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11840814 (10herron) [18:09:17] (03PS2) 10Jforrester: Attribution: Clean up API spec descriptions [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275475 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga) [18:09:17] (03PS2) 10Jforrester: i18n: Use {{doc-markdown}} template in Attribution qqq.json [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275476 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga) [18:09:17] (03PS2) 10Jforrester: Attribution: Documentation copyedits [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 (owner: 10Pmiazga) [18:09:17] (03PS2) 10Jforrester: Attribution: Add localized texts for trending param [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275478 (https://phabricator.wikimedia.org/T423541) (owner: 10Pmiazga) [18:09:19] (03PS1) 10Jforrester: Attribution: Update contact and add call to action [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275488 (https://phabricator.wikimedia.org/T422502) [18:09:42] (03CR) 10Herron: [C:03+2] "yes!" [puppet] - 10https://gerrit.wikimedia.org/r/1270974 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [18:11:30] !log drop of langlinks table on testcommonswiki (T421914) [18:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:34] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [18:12:58] (03CR) 10JHathaway: [C:03+2] ensure net.netfilter.nf_conntrack_max is updated [puppet] - 10https://gerrit.wikimedia.org/r/1272832 (owner: 10JHathaway) [18:15:49] (03CR) 10JHathaway: [C:03+1] ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [18:15:52] (03PS1) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [18:16:15] (03CR) 10Ayounsi: [C:03+1] Add ganeti5006 to the routed Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275461 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [18:17:07] (03CR) 10CI reject: [V:04-1] designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:17:21] (03CR) 10JHathaway: [C:03+1] profile::pki::intermediates: add discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275479 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [18:19:04] 06SRE, 10Wikimedia-Mailing-lists: Request for creation: Wikimedistas de Jujuy mailing list - https://phabricator.wikimedia.org/T423671#11840906 (10Ladsgroup) We need to find a different name for the mailing list. We are trying to standardize the mailing list names. See https://meta.wikimedia.org/wiki/Mailing_l... [18:19:06] (03CR) 10JHathaway: role::pki::multiroot: configure discovery2026 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [18:19:08] (03PS2) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [18:19:15] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fbda72d1550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [18:19:15] dia.org/wiki/Search%23Administration [18:19:21] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:19:36] Heads-up: I'm going to backport an i18n change for MW-Interfaces, rather than have it swamp the normal window. [18:19:41] (03CR) 10JHathaway: [C:03+1] Add fake private secrets for discovery2026 PKI intermediate [labs/private] - 10https://gerrit.wikimedia.org/r/1275481 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [18:20:18] (03PS3) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [18:20:21] (03CR) 10Herron: [C:03+2] puppet: remove pyrra modules/profiles [puppet] - 10https://gerrit.wikimedia.org/r/1270996 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [18:20:26] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:21:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275475 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga) [18:21:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275476 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga) [18:21:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 (owner: 10Pmiazga) [18:21:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275488 (https://phabricator.wikimedia.org/T422502) (owner: 10Jforrester) [18:21:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275478 (https://phabricator.wikimedia.org/T423541) (owner: 10Pmiazga) [18:21:51] (03PS4) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [18:23:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:24:01] (03PS3) 10Herron: puppet: remove pyrra modules/profiles [puppet] - 10https://gerrit.wikimedia.org/r/1270996 (https://phabricator.wikimedia.org/T423307) [18:24:15] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f37550f9550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [18:24:15] dia.org/wiki/Search%23Administration [18:24:37] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 353.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:24:59] (03Merged) 10jenkins-bot: Attribution: Clean up API spec descriptions [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275475 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga) [18:25:01] (03Merged) 10jenkins-bot: i18n: Use {{doc-markdown}} template in Attribution qqq.json [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275476 (https://phabricator.wikimedia.org/T422502) (owner: 10Pmiazga) [18:25:03] (03Merged) 10jenkins-bot: Attribution: Documentation copyedits [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275477 (owner: 10Pmiazga) [18:25:04] (03Merged) 10jenkins-bot: Attribution: Update contact and add call to action [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275488 (https://phabricator.wikimedia.org/T422502) (owner: 10Jforrester) [18:25:06] (03Merged) 10jenkins-bot: Attribution: Add localized texts for trending param [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275478 (https://phabricator.wikimedia.org/T423541) (owner: 10Pmiazga) [18:25:30] (03CR) 10Herron: [C:03+2] puppet: remove pyrra modules/profiles [puppet] - 10https://gerrit.wikimedia.org/r/1270996 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [18:25:31] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1275475|Attribution: Clean up API spec descriptions (T422502)]], [[gerrit:1275476|i18n: Use {{doc-markdown}} template in Attribution qqq.json (T422502)]], [[gerrit:1275477|Attribution: Documentation copyedits]], [[gerrit:1275488|Attribution: Update contact and add call to action (T422502)]], [[gerrit:1275478|Attribution: Add localized texts for tren [18:25:31] ding param (T423541)]] [18:25:37] T422502: Clean up Attribution API spec descriptions - https://phabricator.wikimedia.org/T422502 [18:25:37] T423541: 'trending' signal in the Attribution API is not returning the correct descriptions in the schema - https://phabricator.wikimedia.org/T423541 [18:28:00] 06SRE, 10Wikimedia-Mailing-lists: Request for creation: Wikimedistas de Jujuy mailing list - https://phabricator.wikimedia.org/T423671#11840986 (10Luisfff2812) Hi @Ladsgroup thank you for your guidance! We plan to apply for User Group recognition starting next year, this year we are focused on strengthening ou... [18:29:23] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [18:29:35] (03PS5) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [18:29:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:29:51] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841002 (10phaultfinder) [18:34:16] 06SRE, 10Wikimedia-Mailing-lists: Request for creation: Wikimedistas de Jujuy mailing list - https://phabricator.wikimedia.org/T423671#11841011 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done! https://lists.wikimedia.org/postorius/lists/wikimedia-jujuy.lists.wikimedia.org [18:36:06] (03PS6) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [18:36:19] 06SRE, 10Wikimedia-Mailing-lists: Request for creation: Wikimedistas de Jujuy mailing list - https://phabricator.wikimedia.org/T423671#11841016 (10Luisfff2812) Thank you so much, @Ladsgroup!!! [18:37:58] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:41:07] (03CR) 10Jforrester: "@zabe I cherry-picked this speculatively; do you think we should deploy it?" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1273787 (https://phabricator.wikimedia.org/T423654) (owner: 10Jforrester) [18:42:20] !log jforrester@deploy1003 pmiazga, jforrester: Backport for [[gerrit:1275475|Attribution: Clean up API spec descriptions (T422502)]], [[gerrit:1275476|i18n: Use {{doc-markdown}} template in Attribution qqq.json (T422502)]], [[gerrit:1275477|Attribution: Documentation copyedits]], [[gerrit:1275488|Attribution: Update contact and add call to action (T422502)]], [[gerrit:1275478|Attribution: Add localized texts for trending [18:42:20] param (T423541)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:42:31] T422502: Clean up Attribution API spec descriptions - https://phabricator.wikimedia.org/T422502 [18:42:31] T423541: 'trending' signal in the Attribution API is not returning the correct descriptions in the schema - https://phabricator.wikimedia.org/T423541 [18:42:45] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622#11841044 (10Jdforrester-WMF) 05In progress→03Resolved OK, this should now be Resolved. Hopefully. [18:43:25] (03PS7) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [18:43:36] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:44:16] !log jforrester@deploy1003 pmiazga, jforrester: Continuing with sync [18:47:00] 10SRE-Access-Requests: Add spare FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T423943 (10jasmine_) 03NEW [18:49:21] (03PS8) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [18:50:57] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:52:46] (03PS3) 10Effie Mouzeli: site.pp: switch insetup rdb* servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1275430 [18:53:23] James_F: I'd like to deploy a scap update when you're done. [18:54:17] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:54:37] (03CR) 10Jforrester: [C:03+1] mw-wikifunctions: Remove in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [18:55:14] 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11841090 (10herron) [18:55:23] (03PS9) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [18:55:54] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275475|Attribution: Clean up API spec descriptions (T422502)]], [[gerrit:1275476|i18n: Use {{doc-markdown}} template in Attribution qqq.json (T422502)]], [[gerrit:1275477|Attribution: Documentation copyedits]], [[gerrit:1275488|Attribution: Update contact and add call to action (T422502)]], [[gerrit:1275478|Attribution: Add localized texts for tre [18:55:54] nding param (T423541)]] (duration: 30m 23s) [18:55:56] dancy: Absolutely; should be done now. [18:55:58] T422502: Clean up Attribution API spec descriptions - https://phabricator.wikimedia.org/T422502 [18:55:58] T423541: 'trending' signal in the Attribution API is not returning the correct descriptions in the schema - https://phabricator.wikimedia.org/T423541 [18:56:08] Thanks! [18:56:24] !log dancy@deploy1003 Installing scap version "4.249.0" for 2 host(s) [18:56:28] (03PS4) 10Effie Mouzeli: site.pp: switch insetup rdb* servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1275430 [18:57:14] (03Abandoned) 10Effie Mouzeli: site.pp: switch insetup rdb* servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1275430 (owner: 10Effie Mouzeli) [18:58:18] !log dancy@deploy1003 Installation of scap version "4.249.0" completed for 2 hosts [18:58:47] I'm done. [18:59:17] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:59:21] (03PS2) 10Elukey: role::pki::multiroot: configure discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) [18:59:34] (03CR) 10Elukey: role::pki::multiroot: configure discovery2026 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [19:00:00] (03PS10) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [19:00:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [19:00:29] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2190: Security update [19:00:45] (03PS1) 10Effie Mouzeli: site.pp: switch insetup rdb* servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1275497 [19:02:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. There is still one ferm service to port to firewall::serviee before the rdb* hosts are ready (redis_master_role), but easy eno" [puppet] - 10https://gerrit.wikimedia.org/r/1275497 (owner: 10Effie Mouzeli) [19:03:46] (03PS2) 10Scott French: P:mediawiki::php: Support component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275491 (https://phabricator.wikimedia.org/T422964) [19:04:14] (03PS2) 10Scott French: hieradata: Switch deployment hosts to component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275492 (https://phabricator.wikimedia.org/T422964) [19:04:16] (03PS2) 10Scott French: hieradata: Switch parsoidtest1001 to component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275493 (https://phabricator.wikimedia.org/T422964) [19:04:21] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:05:10] (03CR) 10JHathaway: [C:03+1] role::pki::multiroot: configure discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [19:05:29] (03CR) 10JHathaway: [C:03+1] role::pki::multiroot: configure discovery2026 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [19:07:15] (03PS11) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [19:07:51] (03PS1) 10Jasmine: admin: add spare FIDO backed key [Jasmine] [puppet] - 10https://gerrit.wikimedia.org/r/1275501 (https://phabricator.wikimedia.org/T423943) [19:08:21] (03PS1) 10Effie Mouzeli: (DNM) site.pp: add role for rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1275502 (https://phabricator.wikimedia.org/T418261) [19:11:26] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [19:12:22] (03CR) 10Effie Mouzeli: [C:03+1] "Verified OOB" [puppet] - 10https://gerrit.wikimedia.org/r/1275501 (https://phabricator.wikimedia.org/T423943) (owner: 10Jasmine) [19:12:41] (03CR) 10Jforrester: "See my existing patch, though we can use this one instead if you prefer. But let's follow MSB naming here (so evaluator-rust-javascript no" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [19:14:25] 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11841207 (10herron) [19:16:14] (03CR) 10Effie Mouzeli: [C:03+1] hieradata: Switch deployment hosts to component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275492 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French) [19:16:59] (03CR) 10Scott French: "Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1275491 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French) [19:17:10] (03CR) 10Effie Mouzeli: [C:03+1] P:mediawiki::php: Support component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275491 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French) [19:19:38] (03PS12) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [19:19:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [19:21:54] (03CR) 10Effie Mouzeli: [C:03+1] hieradata: Switch parsoidtest1001 to component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275493 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French) [19:22:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11841270 (10wiki_willy) @Jclark-ctr & @VRiley-WMF - can you provide a status on this one? [19:25:27] 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11841294 (10elukey) @ayounsi I think that iDRAC 10 hosts don't support the new LLDP code :( T418899#11840735 [19:27:30] (03PS1) 10Elukey: sre.hosts.provision: skip LLDP settings for iDRAC 10+ hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367) [19:28:46] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:29:01] 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11841312 (10herron) [19:29:04] 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11841313 (10herron) 05Open→03Resolved a:03herron [19:29:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11841318 (10ssingh) [19:30:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11841319 (10ssingh) Clarified the scope of work, "Set up lvs1017 with new NIC" is DC Ops and then Traffic is responsible for the other bits in the task ("Promote lvs1017"). [19:33:56] (03PS13) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [19:33:56] (03PS1) 10Andrew Bogott: cloudinfra hiera: remove obsolete hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1275511 [19:34:07] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [19:34:34] (03CR) 10CI reject: [V:04-1] designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [19:37:55] (03CR) 10Scott French: [C:03+1] mw-wikifunctions: Remove in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [19:37:57] (03PS14) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [19:38:02] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [19:38:16] (03CR) 10Elukey: "Not pretty I know, but I haven't found a good solution yet :(" [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367) (owner: 10Elukey) [19:43:38] (03PS15) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [19:43:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [19:44:19] (03CR) 10RLazarus: [C:03+2] "and re-verified just for good measure!" [puppet] - 10https://gerrit.wikimedia.org/r/1275501 (https://phabricator.wikimedia.org/T423943) (owner: 10Jasmine) [19:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841349 (10phaultfinder) [19:45:04] (03CR) 10Scott French: [C:03+1] mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [19:46:03] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2190: Security update [19:47:53] elukey@cumin1003 provision (PID 3504517) is awaiting input [19:48:16] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11841356 (10elukey) The other error seems to be: ` Created attribute BIOS.Setup.1-1 -> UncoreFrequency (with Set On Import True) with value DynamicUFS ` [19:48:19] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:53:15] (03PS16) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [19:53:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [19:53:47] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:53:51] (03CR) 10CI reject: [V:04-1] designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [19:57:30] (03PS17) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [19:57:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [19:58:05] (03CR) 10CI reject: [V:04-1] designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T2000). [20:00:05] aude: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] hi [20:00:53] looks like mine is the only patch so i can handle it [20:02:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude) [20:04:47] (03Merged) 10jenkins-bot: Do not show donate button on affiliate wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275428 (https://phabricator.wikimedia.org/T423876) (owner: 10Aude) [20:05:05] !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1275428|Do not show donate button on affiliate wikis (T423876)]] [20:05:10] T423876: Remove donate button on Vector 2022 from affiliate wikis - https://phabricator.wikimedia.org/T423876 [20:06:30] (03PS18) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [20:08:37] !log aude@deploy1003 aude: Backport for [[gerrit:1275428|Do not show donate button on affiliate wikis (T423876)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:10:15] !log aude@deploy1003 aude: Continuing with sync [20:13:40] 06SRE, 10SRE-Access-Requests: Add spare FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T423942#11841439 (10Aklapper) →14Duplicate dup:03T423943 [20:13:42] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add spare FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T423943#11841441 (10Aklapper) [20:13:55] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add spare FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T423943#11841444 (10Aklapper) a:03jasmine_ [20:16:02] !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275428|Do not show donate button on affiliate wikis (T423876)]] (duration: 10m 57s) [20:16:03] (03CR) 10Andrew Bogott: "only 18 tries to get puppet reduce working :(" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [20:16:06] T423876: Remove donate button on Vector 2022 from affiliate wikis - https://phabricator.wikimedia.org/T423876 [20:19:15] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1632, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_ [20:19:15] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:39:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841520 (10phaultfinder) [20:41:34] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11841521 (10Jhancock.wm) @elukey did we get anything back from SM on the ticket you opened for this one? [20:45:15] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 825, active_shards: 1651, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassig [20:45:15] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:48:15] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1533, relocating_shards: 1, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_ [20:48:15] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:52:08] sorry, i'm late to the backport window. are backports still in progress? [20:53:23] FIRING: [8x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:53:31] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841538 (10phaultfinder) [20:55:16] (03PS2) 10Ecarg: Wikifunctions: add helm values for function-evaluator in Rust [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) [20:55:39] (03CR) 10Ecarg: "sry, do you have a link to that patch? I'm having trouble finding it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T2100). [21:02:15] Hey all - we have a couple of security patches to get out today... [21:03:12] (03CR) 10Bking: [C:03+2] opensearch on k8s: Activate semantic-search and ipoid in services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1272909 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [21:22:33] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 7846.32 ms [21:23:49] (03CR) 10Andrew Bogott: [C:03+2] Trove guest-agent: update postgresql and mariadb backup versions [puppet] - 10https://gerrit.wikimedia.org/r/1261579 (https://phabricator.wikimedia.org/T420737) (owner: 10Andrew Bogott) [21:24:21] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [21:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841633 (10phaultfinder) [21:25:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:25:55] preparing to run scap [21:28:08] FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11841638 (10Jclark-ctr) @Jgreen I have not received any updates on mgmt usernames, but I have a feeling we will not be able to use “root” as the username on mgmt for Supermicro... [21:31:00] (03PS1) 10Ryan Kemper: prometheus: fix wmf-elasticsearch-exporter listen address on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) [21:33:08] Deployed security fix for T299359 [21:33:09] (03PS2) 10Ryan Kemper: prometheus: fix wmf-elasticsearch-exporter listen address on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) [21:33:11] !log Deployed security fix for T299359 [21:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:14] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:34:30] maryum: if you don't mind pinging me whenever you're finished, I've got some stuff to go out, but no rush :) [21:34:42] yes about to run scap once more and then I'm done [21:34:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11841648 (10phaultfinder) [21:34:48] rad [21:35:25] (03PS3) 10Ryan Kemper: prometheus: fix wmf-elasticsearch-exporter listen address on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) [21:38:17] (03CR) 10Catrope: [C:03+2] Set CSP to enforce with currently-allow-listed domains on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [21:39:08] (03Merged) 10jenkins-bot: Set CSP to enforce with currently-allow-listed domains on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [21:40:01] (03CR) 10Bking: [C:03+1] prometheus: fix wmf-elasticsearch-exporter listen address on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:41:09] (03CR) 10Bking: [C:03+1] "Confirmed working on cloudelastic1011 (bullseye/Python 3.9) and cloudelastic1012 (trixie/Python 3.12)" [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:42:41] rzl: finished with scap [21:42:53] !log Deployed security fix for T406954 [21:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:26] (03CR) 10Ryan Kemper: [C:03+2] prometheus: fix wmf-elasticsearch-exporter listen address on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1275535 (https://phabricator.wikimedia.org/T422860) (owner: 10Ryan Kemper) [21:44:38] maryum: thanks! [21:44:55] (03CR) 10RLazarus: [C:03+2] mw-wikifunctions: Remove in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [21:45:13] (03CR) 10RLazarus: [C:03+2] mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [21:45:31] (03CR) 10RLazarus: mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [21:46:59] "Warning: Undefined array key "default" in /srv/mediawiki-staging/wmf-config/CommonSettings-labs.php on line 576" [21:47:16] (03Merged) 10jenkins-bot: mw-wikifunctions: Remove in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275463 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [21:49:43] maryum: sbassett ^ I think the config patch has upset beta [21:49:55] I didn't deploy anything to config [21:50:13] reedy: only one to core and one to abuse filter [21:50:25] Sure, but the patch has been merged, it will be deployed automatically to beta [21:51:07] reedy: wonder if I should revert both patches [21:51:07] Yeah the 'default' access here is wrong: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1272895/5/wmf-config/CommonSettings-labs.php [21:51:10] (03CR) 10Reedy: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [21:51:19] I'm holding off, it's all yours if you need it [21:51:43] Let me fix [21:51:50] reedy: oky [21:52:20] since I've merged my deployment-charts change, the helmfile diffs will come along when you run scap -- that's fine by me, I'll be here to monitor, but I can revert if you'd prefer to do one thing at a time [21:52:46] (or I can push mine out quickly and be out of your way) [21:53:03] (03PS1) 10Reedy: CommonSettings-labs: Fix up CSP config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275536 (https://phabricator.wikimedia.org/T419612) [21:53:53] rzl: Don't need to hold off for me [21:54:10] (03CR) 10Reedy: [C:03+2] CommonSettings-labs: Fix up CSP config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275536 (https://phabricator.wikimedia.org/T419612) (owner: 10Reedy) [21:54:33] Reedy: cool, starting a helmfile-only scap then [21:55:03] (03PS1) 10Dzahn: jenkins: add firewall rule for new jenkins to gearman on legacy host [puppet] - 10https://gerrit.wikimedia.org/r/1275537 (https://phabricator.wikimedia.org/T418521) [21:56:04] (03CR) 10Reedy: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [21:56:05] (03Merged) 10jenkins-bot: CommonSettings-labs: Fix up CSP config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275536 (https://phabricator.wikimedia.org/T419612) (owner: 10Reedy) [21:57:03] !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1275463 T423311 T423624 [21:57:08] T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?) - https://phabricator.wikimedia.org/T423311 [21:57:09] T423624: Drop in-pod mcrouter from mw-wikifunctions pod, no longer used - https://phabricator.wikimedia.org/T423624 [21:57:26] 06SRE, 10SRE-Access-Requests: Add spare FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T423943#11841701 (10jasmine_) 05Open→03Resolved [21:58:53] (03CR) 10SBassett: "Ugh, thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275536 (https://phabricator.wikimedia.org/T419612) (owner: 10Reedy) [21:58:55] FIRING: SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:11] !log rzl@deploy1003 Finished scap sync-world: https://gerrit.wikimedia.org/r/1275463 T423311 T423624 (duration: 03m 24s) [21:59:20] (03CR) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [22:17:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274387 (owner: 10C. Scott Ananian) [22:21:51] (03PS1) 10C. Scott Ananian: Revert "Skin: Avoid stretching low resolution images" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275540 (https://phabricator.wikimedia.org/T421524) [22:22:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275540 (https://phabricator.wikimedia.org/T421524) (owner: 10C. Scott Ananian) [22:23:52] (03CR) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [22:25:17] (03CR) 10Scott French: [C:03+1] mwscript-k8s: add --output-file flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis) [22:25:26] (03PS6) 10Jdlrobson: Restore PageImages functionality to Wikisources and Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [22:25:45] (03PS2) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.23.0-a28 [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275541 (https://phabricator.wikimedia.org/T420102) [22:26:18] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.23.0-a28 [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275542 (https://phabricator.wikimedia.org/T423662) [22:26:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275542 (https://phabricator.wikimedia.org/T423662) (owner: 10C. Scott Ananian) [22:26:58] (03CR) 10Reedy: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [22:29:40] (03CR) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [22:37:02] (03PS1) 10Jdlrobson: Don't set href for a link that has been unset [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275543 (https://phabricator.wikimedia.org/T422907) [22:52:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2022.codfw.wmnet with OS trixie [22:52:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11841905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2022.codfw.wmnet with OS trixie [22:52:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2023.codfw.wmnet with OS trixie [22:53:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2024.codfw.wmnet with OS trixie [22:53:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11841906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2023.codfw.wmnet with OS trixie [22:53:15] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11841907 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2024.codfw.wmnet with OS trixie [22:58:31] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [22:59:04] (03CR) 10RLazarus: "Good idea! Comments on the implementation but no objections to doing it." [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis) [23:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260420T2300) [23:02:17] starting some deploys shortly [23:02:23] let me know if any reason not to [23:03:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275540 (https://phabricator.wikimedia.org/T421524) (owner: 10C. Scott Ananian) [23:06:39] (03PS1) 10Jdlrobson: [Mobile Page Previews] Avoid syntax error on older browsers [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275547 (https://phabricator.wikimedia.org/T423959) [23:06:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2022.codfw.wmnet with reason: host reimage [23:07:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2023.codfw.wmnet with reason: host reimage [23:07:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2024.codfw.wmnet with reason: host reimage [23:07:58] (03Merged) 10jenkins-bot: Revert "Skin: Avoid stretching low resolution images" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275540 (https://phabricator.wikimedia.org/T421524) (owner: 10C. Scott Ananian) [23:10:39] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1275540|Revert "Skin: Avoid stretching low resolution images" (T421524 T423676)]] [23:10:45] T421524: Small images are scaled up by thumbnail preference - https://phabricator.wikimedia.org/T421524 [23:10:45] T423676: Infobox images have huge padding in Firefox - https://phabricator.wikimedia.org/T423676 [23:12:06] (03CR) 10Eric Gardner: [C:03+1] [Mobile Page Previews] Avoid syntax error on older browsers [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275547 (https://phabricator.wikimedia.org/T423959) (owner: 10Jdlrobson) [23:12:21] !log jdlrobson@deploy1003 cscott, jdlrobson: Backport for [[gerrit:1275540|Revert "Skin: Avoid stretching low resolution images" (T421524 T423676)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:12:47] !log jdlrobson@deploy1003 cscott, jdlrobson: Continuing with sync [23:14:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2022.codfw.wmnet with reason: host reimage [23:16:36] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275540|Revert "Skin: Avoid stretching low resolution images" (T421524 T423676)]] (duration: 05m 56s) [23:16:41] T421524: Small images are scaled up by thumbnail preference - https://phabricator.wikimedia.org/T421524 [23:16:41] T423676: Infobox images have huge padding in Firefox - https://phabricator.wikimedia.org/T423676 [23:17:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275547 (https://phabricator.wikimedia.org/T423959) (owner: 10Jdlrobson) [23:19:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2023.codfw.wmnet with reason: host reimage [23:19:58] (03Merged) 10jenkins-bot: [Mobile Page Previews] Avoid syntax error on older browsers [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275547 (https://phabricator.wikimedia.org/T423959) (owner: 10Jdlrobson) [23:20:12] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1275547|[Mobile Page Previews] Avoid syntax error on older browsers (T423959)]] [23:20:24] T423959: Page Previews: Instrumentation code throws syntax errors in older browsers - https://phabricator.wikimedia.org/T423959 [23:21:48] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1275547|[Mobile Page Previews] Avoid syntax error on older browsers (T423959)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:24:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2024.codfw.wmnet with reason: host reimage [23:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842040 (10phaultfinder) [23:24:38] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [23:28:25] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275547|[Mobile Page Previews] Avoid syntax error on older browsers (T423959)]] (duration: 08m 13s) [23:28:29] T423959: Page Previews: Instrumentation code throws syntax errors in older browsers - https://phabricator.wikimedia.org/T423959 [23:29:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2022.codfw.wmnet with OS trixie [23:29:21] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11842044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2022.codfw.wmnet with OS trixie completed: - pc2022 (**WARN**) - Dow... [23:30:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [23:32:04] (03Merged) 10jenkins-bot: Restore PageImages functionality to Wikisources and Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [23:32:21] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1271862|Restore PageImages functionality to Wikisources and Wikibooks (T417538)]] [23:32:25] T417538: Enable PageImages by default for Wikisource and Wikibooks - https://phabricator.wikimedia.org/T417538 [23:34:01] !log jdlrobson@deploy1003 jdlrobson, ignaciorodrguez: Backport for [[gerrit:1271862|Restore PageImages functionality to Wikisources and Wikibooks (T417538)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:34:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2023.codfw.wmnet with OS trixie [23:36:24] !log jdlrobson@deploy1003 jdlrobson, ignaciorodrguez: Continuing with sync [23:39:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2024.codfw.wmnet with OS trixie [23:39:20] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11842080 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2024.codfw.wmnet with OS trixie completed: - pc2024 (**WARN**) - Dow... [23:39:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1275554 [23:39:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1275554 (owner: 10TrainBranchBot) [23:40:08] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271862|Restore PageImages functionality to Wikisources and Wikibooks (T417538)]] (duration: 07m 47s) [23:40:17] T417538: Enable PageImages by default for Wikisource and Wikibooks - https://phabricator.wikimedia.org/T417538 [23:50:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1275554 (owner: 10TrainBranchBot) [23:53:15] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11842127 (10Jhancock.wm) 05Open→03Resolved @Marostegui fixed it. but please reopen the ticket if anything seems off. [23:54:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus