[00:07:28] PROBLEM - MariaDB disk space on db1208 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:07:28] PROBLEM - MariaDB Replica SQL: matomo on db1208 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:07:28] PROBLEM - mysqld processes on db1208 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:07:28] PROBLEM - MariaDB Replica IO: matomo on db1208 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:07:29] PROBLEM - MariaDB Replica SQL: analytics_meta on db1208 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:07:29] PROBLEM - MariaDB Replica IO: analytics_meta on db1208 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:07:50] PROBLEM - MariaDB read only matomo on db1208 is CRITICAL: Could not connect to localhost:3351 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [00:07:58] PROBLEM - MariaDB read only analytics_meta on db1208 is CRITICAL: Could not connect to localhost:3352 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [00:08:57] (03CR) 10Santiago Faci: "Also, we should keep in mind that, according to what is said in the related ticket, we don't want to make this change while any experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303490 (owner: 10Pushpaktiwari) [00:11:28] PROBLEM - MariaDB Replica Lag: matomo on db1208 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:11:28] PROBLEM - MariaDB Replica Lag: analytics_meta on db1208 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:17:16] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [01:08:01] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#12048730 (10Papaul) [01:10:15] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#12048732 (10Papaul) [01:12:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1305281 [01:12:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1305281 (owner: 10TrainBranchBot) [01:13:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12048738 (10Papaul) [01:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [01:20:22] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1305281 (owner: 10TrainBranchBot) [01:52:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:41] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:26] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 45s) [02:09:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:50] PROBLEM - MariaDB Replica Lag: m2 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 656.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:13:48] RECOVERY - MariaDB Replica Lag: m2 on db2160 is OK: OK slave_sql_lag Replication lag: 0.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:25:41] (03PS1) 10Clare Ming: Remove saved groups config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305287 (https://phabricator.wikimedia.org/T429959) [03:26:53] (03PS2) 10Clare Ming: Remove saved groups config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305287 (https://phabricator.wikimedia.org/T429959) [03:31:36] PROBLEM - Ensure traffic_manager is running for instance backend on cp6009 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:32:36] RECOVERY - Ensure traffic_manager is running for instance backend on cp6009 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:35:53] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.4.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305288 (https://phabricator.wikimedia.org/T428984) [03:38:01] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.4.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305289 (https://phabricator.wikimedia.org/T428984) [03:44:36] (03PS3) 10Abijeet Patro: Enable ULS v2 on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305290 [03:44:39] (03CR) 10CI reject: [V:04-1] Enable ULS v2 on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305290 (owner: 10Abijeet Patro) [03:46:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:55:32] (03CR) 10Abijeet Patro: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305290 (owner: 10Abijeet Patro) [03:55:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305290 (owner: 10Abijeet Patro) [04:17:16] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [05:13:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis isvwiki in section s5 [05:13:43] 07Puppet, 06Release-Engineering-Team: registry-homepage-builder.py doesn't sort images as expected - https://phabricator.wikimedia.org/T388287#12048852 (10hashar) The `build-homepage` service is indeed failing https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&from=now-5m&to=now&timezone=utc&var-... [05:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:22:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis isvwiki in section s5 [05:25:52] (03PS1) 10Ryan Kemper: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) [05:30:22] (03PS1) 10Marostegui: mariadb: Check future m2-master [puppet] - 10https://gerrit.wikimedia.org/r/1305322 (https://phabricator.wikimedia.org/T429929) [05:31:11] (03CR) 10Marostegui: [C:03+2] mariadb: Check future m2-master [puppet] - 10https://gerrit.wikimedia.org/r/1305322 (https://phabricator.wikimedia.org/T429929) (owner: 10Marostegui) [05:32:02] (03PS11) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) [05:33:38] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [05:35:30] (03PS1) 10Marostegui: Revert "mariadb: Check future m2-master" [puppet] - 10https://gerrit.wikimedia.org/r/1305323 [05:35:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:36:37] (03CR) 10Marostegui: [C:03+2] Revert "mariadb: Check future m2-master" [puppet] - 10https://gerrit.wikimedia.org/r/1305323 (owner: 10Marostegui) [05:41:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set es7 eqiad as read-only for maintenance - T429867', diff saved to https://phabricator.wikimedia.org/P94392 and previous config saved to /var/cache/conftool/dbconfig/20260624-054106-marostegui.json [05:41:12] T429867: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T429867 [05:41:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Primary switchover es7 T429867 [05:41:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set es1039 with weight 0 T429867', diff saved to https://phabricator.wikimedia.org/P94393 and previous config saved to /var/cache/conftool/dbconfig/20260624-054131-marostegui.json [05:42:11] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1305020 (https://phabricator.wikimedia.org/T429867) (owner: 10Gerrit maintenance bot) [05:44:24] !log Starting es7 eqiad failover from es1035 to es1039 - T429867 [05:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1039 to es7 primary T429867', diff saved to https://phabricator.wikimedia.org/P94394 and previous config saved to /var/cache/conftool/dbconfig/20260624-054446-marostegui.json [05:44:55] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12048903 (10ayounsi) [05:45:09] (03CR) 10Marostegui: [C:03+2] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1305021 (https://phabricator.wikimedia.org/T429867) (owner: 10Gerrit maintenance bot) [05:45:16] !log marostegui@dns1004 START - running authdns-update [05:45:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1035 T429867', diff saved to https://phabricator.wikimedia.org/P94395 and previous config saved to /var/cache/conftool/dbconfig/20260624-054547-marostegui.json [05:46:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set es7 eqiad back to read-write - T429867', diff saved to https://phabricator.wikimedia.org/P94396 and previous config saved to /var/cache/conftool/dbconfig/20260624-054611-marostegui.json [05:46:16] T429867: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T429867 [05:47:07] !log marostegui@dns1004 END - running authdns-update [05:52:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:55:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:55:41] !log marostegui@cumin1003 dbmaint on es7@eqiad T429463 [05:55:47] T429463: Migrate es7 section to Debian Trixie - https://phabricator.wikimedia.org/T429463 [05:55:50] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1035: Upgrading es1035.eqiad.wmnet [05:56:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1035: Upgrading es1035.eqiad.wmnet [05:56:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:57:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12048932 (10ayounsi) [05:59:40] marostegui@cumin1003 major-upgrade (PID 2533770) is awaiting input [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T0600) [06:04:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:07:59] marostegui@cumin1003 major-upgrade (PID 2533770) is awaiting input [06:08:55] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1035.eqiad.wmnet with OS trixie [06:21:51] (03CR) 10Slyngshede: [C:03+1] hiera: disable awslc on magru hosts [puppet] - 10https://gerrit.wikimedia.org/r/1305128 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [06:24:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1035.eqiad.wmnet with reason: host reimage [06:27:25] (03CR) 10Jelto: [C:03+2] gerrit: increase thresholds for GerritHigh4xxRatio alert [alerts] - 10https://gerrit.wikimedia.org/r/1304506 (https://phabricator.wikimedia.org/T428979) (owner: 10Jelto) [06:29:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1035.eqiad.wmnet with reason: host reimage [06:30:08] (03Merged) 10jenkins-bot: gerrit: increase thresholds for GerritHigh4xxRatio alert [alerts] - 10https://gerrit.wikimedia.org/r/1304506 (https://phabricator.wikimedia.org/T428979) (owner: 10Jelto) [06:32:03] (03PS1) 10Slyngshede: data.yaml: new expiry date for aramilferaxa [puppet] - 10https://gerrit.wikimedia.org/r/1305326 [06:35:02] (03CR) 10Arnaudb: "I see! thanks for the review, lets leave that change aside for now then." [puppet] - 10https://gerrit.wikimedia.org/r/1302834 (https://phabricator.wikimedia.org/T420865) (owner: 10Arnaudb) [06:36:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1305326 (owner: 10Slyngshede) [06:42:08] (03CR) 10Arnaudb: [C:03+1] "thanks for adjusting the threshold!" [alerts] - 10https://gerrit.wikimedia.org/r/1304506 (https://phabricator.wikimedia.org/T428979) (owner: 10Jelto) [06:45:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin2003.codfw.wmnet [06:46:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1035.eqiad.wmnet with OS trixie [06:51:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2003.codfw.wmnet [06:53:06] (03CR) 10Slyngshede: [C:03+2] data.yaml: new expiry date for aramilferaxa [puppet] - 10https://gerrit.wikimedia.org/r/1305326 (owner: 10Slyngshede) [06:53:38] !log jmm@cumin2003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [06:54:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1035: Migration of es1035.eqiad.wmnet completed [06:54:21] !log jmm@cumin2003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2028.codfw.wmnet [06:54:28] !log jmm@cumin2003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2046.codfw.wmnet [06:54:48] !log jmm@cumin2003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2046.codfw.wmnet [06:54:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:55:01] (03PS1) 10Matthias Mullie: Enable MMV carousel on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305329 (https://phabricator.wikimedia.org/T429509) [06:55:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305329 (https://phabricator.wikimedia.org/T429509) (owner: 10Matthias Mullie) [06:57:01] !log jmm@cumin2003 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-staging-etcd2001.codfw.wmnet to drbd [06:57:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305329 (https://phabricator.wikimedia.org/T429509) (owner: 10Matthias Mullie) [06:59:03] 07Puppet, 06Release-Engineering-Team: registry-homepage-builder.py doesn't sort images as expected - https://phabricator.wikimedia.org/T388287#12049032 (10elukey) ` Jun 24 06:24:07 registry2004 registry-homepage-builder[3522966]: INFO:root:Fetching the image catalog for localhost:5004 Jun 24 06:24:07 registry2... [06:59:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:04] Amir1, urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T0700). [07:00:04] matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:09] o/ [07:00:17] I've already begun [07:00:20] (03Merged) 10jenkins-bot: Enable MMV carousel on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305329 (https://phabricator.wikimedia.org/T429509) (owner: 10Matthias Mullie) [07:01:22] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1305329|Enable MMV carousel on enwiki (T429509)]] [07:01:26] T429509: [Image Browsing] Carousel: Take the feature out of beta and set up a config variable to enable in production - https://phabricator.wikimedia.org/T429509 [07:03:54] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1305329|Enable MMV carousel on enwiki (T429509)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:04:47] !log mlitn@deploy1003 mlitn: Continuing with deployment [07:05:57] (03PS1) 10Elukey: docker_registry: improve homepage-builder.py's tag ordering [puppet] - 10https://gerrit.wikimedia.org/r/1305330 (https://phabricator.wikimedia.org/T388287) [07:06:29] (03CR) 10CI reject: [V:04-1] docker_registry: improve homepage-builder.py's tag ordering [puppet] - 10https://gerrit.wikimedia.org/r/1305330 (https://phabricator.wikimedia.org/T388287) (owner: 10Elukey) [07:07:04] !log jmm@cumin2003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-staging-etcd2001.codfw.wmnet to drbd [07:07:19] PROBLEM - Host ml-staging-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [07:07:43] (03PS2) 10Elukey: docker_registry: improve homepage-builder.py's tag ordering [puppet] - 10https://gerrit.wikimedia.org/r/1305330 (https://phabricator.wikimedia.org/T388287) [07:07:47] RECOVERY - Host ml-staging-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 31.89 ms [07:08:15] (03CR) 10CI reject: [V:04-1] docker_registry: improve homepage-builder.py's tag ordering [puppet] - 10https://gerrit.wikimedia.org/r/1305330 (https://phabricator.wikimedia.org/T388287) (owner: 10Elukey) [07:08:26] !log jmm@cumin2003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [07:08:38] (03CR) 10Elukey: [C:03+2] docker_registry: remove support for the nginx blob cache [puppet] - 10https://gerrit.wikimedia.org/r/1304512 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [07:09:12] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305329|Enable MMV carousel on enwiki (T429509)]] (duration: 07m 49s) [07:09:16] T429509: [Image Browsing] Carousel: Take the feature out of beta and set up a config variable to enable in production - https://phabricator.wikimedia.org/T429509 [07:09:30] !log jmm@cumin2003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [07:09:37] Done; rest of backport window is up for grabs [07:11:16] (03PS1) 10Marostegui: db2202: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1305331 (https://phabricator.wikimedia.org/T430017) [07:11:32] !log jmm@cumin2003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [07:11:39] (03PS2) 10Jelto: Update to v3.30.7 [debs/calico] (v3.30) - 10https://gerrit.wikimedia.org/r/1305139 (https://phabricator.wikimedia.org/T427400) [07:11:39] (03CR) 10Jelto: [V:03+1] "build verified on `build2002`" [debs/calico] (v3.30) - 10https://gerrit.wikimedia.org/r/1305139 (https://phabricator.wikimedia.org/T427400) (owner: 10Jelto) [07:13:07] (03PS3) 10Elukey: docker_registry: improve homepage-builder.py's tag ordering [puppet] - 10https://gerrit.wikimedia.org/r/1305330 (https://phabricator.wikimedia.org/T388287) [07:15:19] (03CR) 10Marostegui: "Can you do some testing around with test-cookbook just to make sure it is all working as expected without any major issues." [cookbooks] - 10https://gerrit.wikimedia.org/r/1295480 (https://phabricator.wikimedia.org/T422361) (owner: 10Federico Ceratto) [07:16:21] (03PS4) 10Hashar: docker_registry: improve homepage-builder.py's tag ordering [puppet] - 10https://gerrit.wikimedia.org/r/1305330 (https://phabricator.wikimedia.org/T388287) (owner: 10Elukey) [07:16:35] (03CR) 10Hashar: [C:03+1] "Great idea, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1305330 (https://phabricator.wikimedia.org/T388287) (owner: 10Elukey) [07:21:17] (03CR) 10Elukey: [C:03+2] docker_registry: improve homepage-builder.py's tag ordering [puppet] - 10https://gerrit.wikimedia.org/r/1305330 (https://phabricator.wikimedia.org/T388287) (owner: 10Elukey) [07:24:41] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:28:36] (03CR) 10Elukey: "Applied suggestion thanks! The ipmi cookbook will go away soon, but yeah I'll update it right after changing spicerack!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [07:29:36] (03PS5) 10Elukey: __init__: modify the management_password property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) [07:29:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:34:08] (03CR) 10CI reject: [V:04-1] __init__: modify the management_password property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [07:39:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1035: Migration of es1035.eqiad.wmnet completed [07:39:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [07:40:54] (03CR) 10Muehlenhoff: [C:03+2] Failover url-downloader.codfw CNAME to one of the new Trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1304764 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [07:40:59] !log jmm@dns1004 START - running authdns-update [07:41:12] (03CR) 10Jcrespo: [C:03+1] "Ok, I don't think it is a problem to use it as a critical host. The host won't be reclaimed intermediately and you could also "return a di" [puppet] - 10https://gerrit.wikimedia.org/r/1305331 (https://phabricator.wikimedia.org/T430017) (owner: 10Marostegui) [07:42:51] !log jmm@dns1004 END - running authdns-update [07:44:33] (03CR) 10Marostegui: [C:03+2] db2202: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1305331 (https://phabricator.wikimedia.org/T430017) (owner: 10Marostegui) [07:45:25] (03PS1) 10Dpogorzelski: ml-serve: temperature/power and partition usage [puppet] - 10https://gerrit.wikimedia.org/r/1305336 (https://phabricator.wikimedia.org/T403697) [07:46:21] (03PS2) 10Dpogorzelski: ml-serve: temperature/power and partition usage [puppet] - 10https://gerrit.wikimedia.org/r/1305336 (https://phabricator.wikimedia.org/T403697) [07:47:33] (03PS3) 10Dpogorzelski: ml-serve: temperature/power and partition usage [puppet] - 10https://gerrit.wikimedia.org/r/1305336 (https://phabricator.wikimedia.org/T403697) [07:48:38] (03PS4) 10Dpogorzelski: ml-serve: temperature/power and partition usage [puppet] - 10https://gerrit.wikimedia.org/r/1305336 (https://phabricator.wikimedia.org/T403697) [07:50:33] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:50:33] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [07:50:39] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [07:50:53] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1199: Upgrading db1199.eqiad.wmnet [07:51:48] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:52:50] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:53:14] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1199: Upgrading db1199.eqiad.wmnet [07:54:44] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:54:44] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [07:55:06] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2179: Upgrading db2179.codfw.wmnet [07:55:17] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:55:38] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2179: Upgrading db2179.codfw.wmnet [07:56:32] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1199.eqiad.wmnet with OS trixie [07:57:55] (03PS1) 10Kevin Bazira: ml: assemble venv in build stage and chunk runtime layers to fit registry limit [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305341 (https://phabricator.wikimedia.org/T429667) [07:58:10] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2179.codfw.wmnet with OS trixie [08:01:22] (03PS1) 10Brouberol: Fix prometheus for kafka monitoring, to fix linting alert [alerts] - 10https://gerrit.wikimedia.org/r/1305342 (https://phabricator.wikimedia.org/T429127) [08:03:03] (03PS2) 10Kevin Bazira: ml: assemble venv in build stage and chunk runtime layers to fit registry limit [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305341 (https://phabricator.wikimedia.org/T429667) [08:05:39] (03CR) 10Brouberol: [C:03+2] Fix prometheus for kafka monitoring, to fix linting alert [alerts] - 10https://gerrit.wikimedia.org/r/1305342 (https://phabricator.wikimedia.org/T429127) (owner: 10Brouberol) [08:06:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:07:31] !log depooling cp7001 and cp7009 to reimage (T419825) [08:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:36] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [08:08:48] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7001.* [08:08:56] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7001.* [08:09:07] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7009.* [08:09:27] (03CR) 10Fabfur: [C:03+2] hiera: disable awslc on magru hosts [puppet] - 10https://gerrit.wikimedia.org/r/1305128 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [08:10:49] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1199.eqiad.wmnet with reason: host reimage [08:11:36] (03PS1) 10Muehlenhoff: Update redis-misc-canary alias [puppet] - 10https://gerrit.wikimedia.org/r/1305344 [08:13:36] (03CR) 10Muehlenhoff: [C:03+2] Move the the hourly httpbb run to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1304803 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [08:13:47] !log fabfur@cumin1003 START - Cookbook sre.hosts.reimage for host cp7001.magru.wmnet with OS trixie [08:13:57] 10SRE-swift-storage, 06Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#12049171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1003 for host cp7001.magru.wmnet with OS trixie [08:14:35] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1199.eqiad.wmnet with reason: host reimage [08:16:31] (03PS5) 10Ayounsi: netbox: add a BGP getter/setter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304554 [08:16:31] (03PS1) 10Ayounsi: tox: add python 3.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305345 [08:16:56] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2179.codfw.wmnet with reason: host reimage [08:17:16] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [08:17:34] fabfur: okay to merge your "hiera: disable awslc on magru hosts" patch along? [08:18:40] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s1 [08:19:48] moritzm: sorry forgot it, thanks [08:20:28] ok [08:21:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1013.eqiad.wmnet with reason: Cloning cloddb1026 [08:24:40] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2179.codfw.wmnet with reason: host reimage [08:25:47] (03PS2) 10Tiziano Fogli: redis: remove nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/1305075 (https://phabricator.wikimedia.org/T384924) (owner: 10Hnowlan) [08:25:47] (03PS1) 10Tiziano Fogli: redis: disable nrpe checks, replace with prometheus checks [puppet] - 10https://gerrit.wikimedia.org/r/1305347 (https://phabricator.wikimedia.org/T384924) [08:26:19] (03PS1) 10Marostegui: mariadb: Productionize cloudb1026 [puppet] - 10https://gerrit.wikimedia.org/r/1305349 (https://phabricator.wikimedia.org/T409557) [08:27:40] !log fabfur@cumin1003 START - Cookbook sre.hosts.reimage for host cp7009.magru.wmnet with OS trixie [08:29:39] (03CR) 10Tiziano Fogli: "I just added a commit to disable the Icinga check before deleting it from the configuration." [puppet] - 10https://gerrit.wikimedia.org/r/1305347 (https://phabricator.wikimedia.org/T384924) (owner: 10Tiziano Fogli) [08:31:53] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1199.eqiad.wmnet with OS trixie [08:32:14] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:33:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:34:19] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize cloudb1026 [puppet] - 10https://gerrit.wikimedia.org/r/1305349 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [08:35:55] (03PS1) 10Ayounsi: Add depool policy for VTRS [puppet] - 10https://gerrit.wikimedia.org/r/1305350 (https://phabricator.wikimedia.org/T327300) [08:37:58] (03CR) 10Tiziano Fogli: [C:03+1] redis: migrate icinga checks to prometheus [alerts] - 10https://gerrit.wikimedia.org/r/1305072 (https://phabricator.wikimedia.org/T384924) (owner: 10Hnowlan) [08:38:15] (03CR) 10Tiziano Fogli: [C:03+1] redis: remove nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/1305075 (https://phabricator.wikimedia.org/T384924) (owner: 10Hnowlan) [08:38:37] !log fabfur@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7001.magru.wmnet with reason: host reimage [08:40:42] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics Production Access for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T429896#12049245 (10MoritzMuehlenhoff) [08:40:58] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics Production Access for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T429896#12049248 (10MoritzMuehlenhoff) @Gehel This needs your approval for analytics-wmde-users [08:42:04] (03PS1) 10Muehlenhoff: Add nicholusmuwonge to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1305352 (https://phabricator.wikimedia.org/T429896) [08:43:24] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2179.codfw.wmnet with OS trixie [08:43:45] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7001.magru.wmnet with reason: host reimage [08:46:41] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1199: Migration of db1199.eqiad.wmnet completed [08:48:51] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM from an o11y perspective. I'll leave the specifics to the team." [alerts] - 10https://gerrit.wikimedia.org/r/1300745 (https://phabricator.wikimedia.org/T428873) (owner: 10Filippo Giunchedi) [08:48:59] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM from an o11y perspective. I'll leave the specifics to the team." [alerts] - 10https://gerrit.wikimedia.org/r/1302151 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [08:49:40] (03CR) 10Ayounsi: diffscan: pyhotnify (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634572 (https://phabricator.wikimedia.org/T415347) (owner: 10Jbond) [08:50:02] !log fabfur@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7009.magru.wmnet with reason: host reimage [08:50:06] !log jmm@cumin2003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [08:50:09] (03PS20) 10Ayounsi: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 (https://phabricator.wikimedia.org/T415347) (owner: 10Jbond) [08:53:56] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7009.magru.wmnet with reason: host reimage [08:56:32] (03CR) 10Ayounsi: Cookbook to configure switch port vlans for cloud hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1303397 (https://phabricator.wikimedia.org/T429466) (owner: 10Cathal Mooney) [08:56:40] (03Abandoned) 10Blake: mw-wikifunctions: Prune host list for mw-wikifunctions ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301313 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [08:58:25] (03CR) 10Cathal Mooney: Cookbook to configure switch port vlans for cloud hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1303397 (https://phabricator.wikimedia.org/T429466) (owner: 10Cathal Mooney) [08:59:59] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2179: Migration of db2179.codfw.wmnet completed [09:00:26] (03PS1) 10Muehlenhoff: Failover url-downloader.eqiad CNAME to one of the new Trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1305354 (https://phabricator.wikimedia.org/T427282) [09:01:14] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Analytics Production Access for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T429896#12049282 (10Gehel) Approved [09:02:27] (03CR) 10Klausman: [C:03+1] ml-serve: temperature/power and partition usage [puppet] - 10https://gerrit.wikimedia.org/r/1305336 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski) [09:02:32] (03PS6) 10Elukey: __init__: modify the management_password property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) [09:03:12] !log temporarily remove ganeti2028 from the codfw cluster T429817 [09:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:17] T429817: codfw: rack A7 maintenance - https://phabricator.wikimedia.org/T429817 [09:04:50] (03CR) 10Slyngshede: [C:03+1] Add laurabarluzzi to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1304751 (https://phabricator.wikimedia.org/T429431) (owner: 10Muehlenhoff) [09:05:28] PROBLEM - ganeti-noded running on ganeti2028 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:05:28] PROBLEM - ganeti-confd running on ganeti2028 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:05:59] (03CR) 10Slyngshede: [C:03+1] Add nicholusmuwonge to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1305352 (https://phabricator.wikimedia.org/T429896) (owner: 10Muehlenhoff) [09:06:02] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Analytics Production Access for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T429896#12049302 (10SLyngshede-WMF) [09:06:50] FIRING: ProbeDown: Service ganeti2028:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:07:12] (03CR) 10Klausman: [V:03+2 C:03+2] ml: assemble venv in build stage and chunk runtime layers to fit registry limit [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305341 (https://phabricator.wikimedia.org/T429667) (owner: 10Kevin Bazira) [09:07:32] (03CR) 10CI reject: [V:04-1] __init__: modify the management_password property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [09:07:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:08:07] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:08:34] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7001.magru.wmnet with OS trixie [09:08:43] 10SRE-swift-storage, 06Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#12049314 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1003 for host cp7001.magru.wmnet with OS trixie completed: - cp7001 (**PASS**) - Downtimed on Icinga/Alertmana... [09:12:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:12:47] elukey@cumin1003 provision (PID 2562859) is awaiting input [09:13:30] (03CR) 10Blake: [C:03+2] main: Add a namespace for the mw-pretrain service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304083 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [09:15:42] (03CR) 10Elukey: tox: add python 3.14 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305345 (owner: 10Ayounsi) [09:16:02] (03CR) 10Muehlenhoff: [C:03+2] Add laurabarluzzi to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1304751 (https://phabricator.wikimedia.org/T429431) (owner: 10Muehlenhoff) [09:18:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for laurabarluzzi - https://phabricator.wikimedia.org/T429431#12049348 (10MoritzMuehlenhoff) 05In progress→03Resolved a:05XenoRyet→03MoritzMuehlenhoff @Laurabarluzzi Your access has been enabled, it... [09:18:54] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7009.magru.wmnet with OS trixie [09:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:20:14] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: temperature/power and partition usage [puppet] - 10https://gerrit.wikimedia.org/r/1305336 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski) [09:20:18] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:21:05] (03CR) 10Elukey: [C:03+1] netbox: add a BGP getter/setter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304554 (owner: 10Ayounsi) [09:22:06] !log jmm@cumin2003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2046.codfw.wmnet [09:22:13] (03Merged) 10jenkins-bot: main: Add a namespace for the mw-pretrain service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304083 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [09:22:57] !log jmm@cumin2003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2046.codfw.wmnet [09:22:59] (03CR) 10Ayounsi: tox: add python 3.14 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305345 (owner: 10Ayounsi) [09:24:28] (03CR) 10Elukey: tox: add python 3.14 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305345 (owner: 10Ayounsi) [09:25:39] (03PS7) 10Elukey: __init__: modify the management_password property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) [09:25:39] (03PS1) 10Elukey: Add setuptools to sphinx's tox environment [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305355 [09:25:46] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:26:18] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:26:47] !log repooling cp7001 and cp7009 after reimage (T419825) [09:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:51] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [09:27:09] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7009.* [09:27:13] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7001.* [09:27:23] !log fabfur@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp7001.magru.wmnet [09:27:24] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7001.magru.wmnet [09:27:30] !log fabfur@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp7009.magru.wmnet [09:27:30] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7009.magru.wmnet [09:28:00] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [09:28:00] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [09:28:31] (03CR) 10Muehlenhoff: [C:03+2] Add nicholusmuwonge to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1305352 (https://phabricator.wikimedia.org/T429896) (owner: 10Muehlenhoff) [09:28:38] (03PS1) 10Jcrespo: versitygw: Fix service not reloading after certificate change [puppet] - 10https://gerrit.wikimedia.org/r/1305356 (https://phabricator.wikimedia.org/T430023) [09:29:31] (03PS2) 10Jcrespo: versitygw: Fix service not reloading after certificate change [puppet] - 10https://gerrit.wikimedia.org/r/1305356 (https://phabricator.wikimedia.org/T430023) [09:29:39] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305356 (https://phabricator.wikimedia.org/T430023) (owner: 10Jcrespo) [09:29:53] (03PS1) 10Elukey: docker_registry: fix homepage-builder.py code [puppet] - 10https://gerrit.wikimedia.org/r/1305357 (https://phabricator.wikimedia.org/T388287) [09:30:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Analytics Production Access for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T429896#12049392 (10MoritzMuehlenhoff) 05Open→03Resolved a:05Gehel→03MoritzMuehlenhoff @Nicholusmuwonge_wmde Your access has been enable... [09:31:22] jouncebot: nowandnext [09:31:22] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [09:31:23] In 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1000) [09:31:32] OK, I'll push some config out now [09:32:13] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1199: Migration of db1199.eqiad.wmnet completed [09:32:14] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [09:32:28] (03CR) 10Jforrester: [C:03+2] [testwiki] Enable Abstract Client integration mode, not just previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304800 (https://phabricator.wikimedia.org/T422657) (owner: 10Jforrester) [09:32:35] (03CR) 10Jforrester: [C:03+2] [abstractwiki] Add the 'allowed' temporary vars for cross-wiki content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304770 (https://phabricator.wikimedia.org/T422657) (owner: 10Jforrester) [09:32:59] (03CR) 10Jforrester: [C:03+2] [abstractwiki] Update favicon with new version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304110 (https://phabricator.wikimedia.org/T429620) (owner: 10Jforrester) [09:33:05] (03CR) 10Jforrester: [C:03+2] WikiLambda: Expose wikilambda-abstract-optin for global group assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305182 (https://phabricator.wikimedia.org/T422698) (owner: 10Jforrester) [09:33:25] (03Merged) 10jenkins-bot: [testwiki] Enable Abstract Client integration mode, not just previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304800 (https://phabricator.wikimedia.org/T422657) (owner: 10Jforrester) [09:33:32] (03Merged) 10jenkins-bot: [abstractwiki] Add the 'allowed' temporary vars for cross-wiki content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304770 (https://phabricator.wikimedia.org/T422657) (owner: 10Jforrester) [09:33:53] (03CR) 10Jcrespo: "Hi, Moritz, this looks like a silly mistake (missing notification) I made when setting up this service in a hurry. A quick sanity check wo" [puppet] - 10https://gerrit.wikimedia.org/r/1305356 (https://phabricator.wikimedia.org/T430023) (owner: 10Jcrespo) [09:33:55] (03Merged) 10jenkins-bot: [abstractwiki] Update favicon with new version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304110 (https://phabricator.wikimedia.org/T429620) (owner: 10Jforrester) [09:34:00] (03Merged) 10jenkins-bot: WikiLambda: Expose wikilambda-abstract-optin for global group assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305182 (https://phabricator.wikimedia.org/T422698) (owner: 10Jforrester) [09:34:42] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1304800|[testwiki] Enable Abstract Client integration mode, not just previews (T422657)]], [[gerrit:1304770|[abstractwiki] Add the 'allowed' temporary vars for cross-wiki content (T422657)]], [[gerrit:1305182|WikiLambda: Expose wikilambda-abstract-optin for global group assignment (T422698)]], [[gerrit:1304110|[abstractwiki] Update favicon with new [09:34:42] version (T429620)]] [09:34:48] T422657: Enable abstract client mode on Test Wikipedia - https://phabricator.wikimedia.org/T422657 [09:34:48] T422698: Grant `wikilambda-abstract-optin` to cross-wiki global groups via mediawiki-config / stewards' config on Special:GlobalGroupPermissions - https://phabricator.wikimedia.org/T422698 [09:34:49] T429620: Fix Abstract Wikipedia favicon - https://phabricator.wikimedia.org/T429620 [09:36:03] (03CR) 10Elukey: [C:03+2] docker_registry: fix homepage-builder.py code [puppet] - 10https://gerrit.wikimedia.org/r/1305357 (https://phabricator.wikimedia.org/T388287) (owner: 10Elukey) [09:36:46] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1304800|[testwiki] Enable Abstract Client integration mode, not just previews (T422657)]], [[gerrit:1304770|[abstractwiki] Add the 'allowed' temporary vars for cross-wiki content (T422657)]], [[gerrit:1305182|WikiLambda: Expose wikilambda-abstract-optin for global group assignment (T422698)]], [[gerrit:1304110|[abstractwiki] Update favicon with new version (T429 [09:36:46] 620)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:36:54] T429: Analyze MW-Vagrant qualitative survey - https://phabricator.wikimedia.org/T429 [09:37:46] !log jforrester@deploy1003 jforrester: Continuing with deployment [09:38:37] (03CR) 10Elukey: "@rcoccioli@wikimedia.org ready!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [09:38:53] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:40:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12049442 (10jcrespo) I've stopped ms-backups2003 network operations for now, codfw media backups will continue to flow temporarily only through ms-backup2004. No hurry on h... [09:42:05] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1304800|[testwiki] Enable Abstract Client integration mode, not just previews (T422657)]], [[gerrit:1304770|[abstractwiki] Add the 'allowed' temporary vars for cross-wiki content (T422657)]], [[gerrit:1305182|WikiLambda: Expose wikilambda-abstract-optin for global group assignment (T422698)]], [[gerrit:1304110|[abstractwiki] Update favicon with new [09:42:05] version (T429620)]] (duration: 07m 23s) [09:42:13] T422657: Enable abstract client mode on Test Wikipedia - https://phabricator.wikimedia.org/T422657 [09:42:14] T422698: Grant `wikilambda-abstract-optin` to cross-wiki global groups via mediawiki-config / stewards' config on Special:GlobalGroupPermissions - https://phabricator.wikimedia.org/T422698 [09:42:14] T429620: Fix Abstract Wikipedia favicon - https://phabricator.wikimedia.org/T429620 [09:42:29] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:43:17] (03CR) 10Aklapper: "Good point. This was overwritten anyway in https://gitlab.wikimedia.org/repos/phabricator/deployment/-/blob/wmf/stable/scap/templates/phab" [puppet] - 10https://gerrit.wikimedia.org/r/1305041 (https://phabricator.wikimedia.org/T330797) (owner: 10Aklapper) [09:43:36] (03PS2) 10Blake: kubernetes: Add a k8s deployment for pretrain. [puppet] - 10https://gerrit.wikimedia.org/r/1305358 (https://phabricator.wikimedia.org/T427668) [09:45:30] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2179: Migration of db2179.codfw.wmnet completed [09:45:31] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [09:47:18] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy outlink model latest version on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305056 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [09:49:21] (03PS12) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) [09:49:34] (03Merged) 10jenkins-bot: ml-services: Deploy outlink model latest version on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305056 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [09:51:50] RESOLVED: ProbeDown: Service ganeti2028:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:52:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:17] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:54:27] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:54:44] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305355 (owner: 10Elukey) [09:55:12] (03PS1) 10Muehlenhoff: thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1305363 [09:57:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:58:13] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s1 [09:58:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305356 (https://phabricator.wikimedia.org/T430023) (owner: 10Jcrespo) [09:59:52] (03PS1) 10Marostegui: check_private_data_report: Add clouddb1026 [puppet] - 10https://gerrit.wikimedia.org/r/1305365 (https://phabricator.wikimedia.org/T409557) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1000) [10:00:08] (03CR) 10Volans: "LGTM, one lost thing in rebase ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [10:01:12] (03CR) 10Marostegui: [C:03+2] check_private_data_report: Add clouddb1026 [puppet] - 10https://gerrit.wikimedia.org/r/1305365 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [10:01:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/634572 (https://phabricator.wikimedia.org/T415347) (owner: 10Jbond) [10:04:09] (03CR) 10Fabfur: [C:03+2] hiera: disable awslc on codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1305131 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [10:05:04] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp2043.* [10:05:08] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp2044.* [10:05:24] !log depooling cp2043 and cp2044 to reimage (T419825) [10:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:27] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [10:08:27] (03PS4) 10Abijeet Patro: Enable ULS v2 by default across all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305290 [10:10:33] !log fabfur@cumin1003 START - Cookbook sre.hosts.reimage for host cp2044.codfw.wmnet with OS trixie [10:10:35] !log fabfur@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS trixie [10:11:01] (03CR) 10Muehlenhoff: [C:03+2] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1305363 (owner: 10Muehlenhoff) [10:13:45] (03CR) 10Muehlenhoff: docker_registry: remove support for the nginx blob cache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1304512 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [10:14:12] (03CR) 10Marostegui: cookbooks/sre/mysql/decommission: add cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [10:17:24] (03PS17) 10Federico Ceratto: mysql: update replication source [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) [10:18:13] (03CR) 10Clément Goubert: [C:03+1] Failover url-downloader.eqiad CNAME to one of the new Trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1305354 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [10:19:01] (03PS2) 10Zabe: Use Hadoop for Mostcategories on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248909 (https://phabricator.wikimedia.org/T413362) [10:19:27] (03PS1) 10Marostegui: installserver: Do not format db1290 [puppet] - 10https://gerrit.wikimedia.org/r/1305369 [10:22:42] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db1290 [puppet] - 10https://gerrit.wikimedia.org/r/1305369 (owner: 10Marostegui) [10:25:42] (03PS1) 10Marostegui: eqiad.yaml: Add clouddb1026 [puppet] - 10https://gerrit.wikimedia.org/r/1305373 (https://phabricator.wikimedia.org/T409557) [10:25:51] (03PS1) 10Muehlenhoff: thumbor: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305374 [10:26:16] !log fabfur@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2044.codfw.wmnet with reason: host reimage [10:26:17] !log fabfur@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2043.codfw.wmnet with reason: host reimage [10:29:00] (03PS3) 10Btullis: presto: Test resource groups and spill features on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) [10:29:00] (03PS3) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [10:30:17] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2044.codfw.wmnet with reason: host reimage [10:30:40] (03PS7) 10Gkyziridis: ml-services: Deploy Qwen3.6 model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305071 (https://phabricator.wikimedia.org/T425680) [10:34:12] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2043.codfw.wmnet with reason: host reimage [10:39:02] (03PS1) 10Blake: kube-state-metrics: Add v2.18.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305377 (https://phabricator.wikimedia.org/T427405) [10:42:59] jouncebot: nowandnext [10:42:59] For the next 0 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1000) [10:42:59] In 0 hour(s) and 17 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1100) [10:43:35] I'm locking scap because I need to test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1304845 on beta, so I need to +2 it and merge, but I don't want it to get pulled just now [10:43:47] Please ping me if that's a problem for you [10:45:03] !log cgoubert@deploy1003 Locking from deployment [ALL REPOSITORIES]: Testing apiportalwiki deletion in beta - T418494 [10:45:08] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [10:45:09] (03CR) 10Clément Goubert: [C:03+2] Remove config related to the API Portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304845 (https://phabricator.wikimedia.org/T429372) (owner: 10Alex Paskulin) [10:46:39] (03PS3) 10Btullis: Grant sudo privileges for the analytics-fr-tech-users group [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) [10:49:34] (03CR) 10MSantos: [C:03+2] Publish public PGP key of Yiannis Giannelos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305151 (https://phabricator.wikimedia.org/T423255) (owner: 10Jgiannelos) [10:49:59] (03CR) 10MSantos: [C:03+2] mediawiki.org keys.html: Limit height of key code blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305200 (owner: 10Bartosz Dziewoński) [10:50:41] mbsantos: I just locked scap btw. [10:50:52] (03Abandoned) 10Jforrester: ExecuteTestAndCacheJob: Don't explode when there are no connected Implementations/Tests [extensions/WikiLambda] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1304563 (https://phabricator.wikimedia.org/T429460) (owner: 10Jforrester) [10:51:22] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2044.codfw.wmnet with OS trixie [10:52:14] (03Merged) 10jenkins-bot: Remove config related to the API Portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304845 (https://phabricator.wikimedia.org/T429372) (owner: 10Alex Paskulin) [10:52:20] (03CR) 10FNegri: [C:03+1] eqiad.yaml: Add clouddb1026 [puppet] - 10https://gerrit.wikimedia.org/r/1305373 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [10:52:46] (03CR) 10Marostegui: [C:03+2] eqiad.yaml: Add clouddb1026 [puppet] - 10https://gerrit.wikimedia.org/r/1305373 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [10:53:16] (03Merged) 10jenkins-bot: Publish public PGP key of Yiannis Giannelos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305151 (https://phabricator.wikimedia.org/T423255) (owner: 10Jgiannelos) [10:53:19] (03Merged) 10jenkins-bot: mediawiki.org keys.html: Limit height of key code blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305200 (owner: 10Bartosz Dziewoński) [10:53:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302201 (owner: 10MSantos) [10:55:33] (03CR) 10Atsuko: presto: Test resource groups and spill features on the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [10:57:01] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2043.codfw.wmnet with OS trixie [10:57:03] (03PS1) 10Kosta Harlan: CheckUserGetUsersPager: Fix TypeError for numeric usernames [extensions/CheckUser] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305378 (https://phabricator.wikimedia.org/T429971) [10:57:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CheckUser] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305378 (https://phabricator.wikimedia.org/T429971) (owner: 10Kosta Harlan) [10:59:02] (03PS1) 10Clément Goubert: CommonSettings-labs: Remove api.wikimedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305380 (https://phabricator.wikimedia.org/T429372) [11:00:05] mvolz: Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1100). Please do the needful. [11:01:22] (03CR) 10Trueg: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [11:01:40] (03CR) 10Zabe: [C:03+1] CommonSettings-labs: Remove api.wikimedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305380 (https://phabricator.wikimedia.org/T429372) (owner: 10Clément Goubert) [11:01:41] Sooo now https://api.wikimedia.beta.wmcloud.org/wiki/Main_Page# is completely broken but somehow still exists even though it's in the deleted.dblist [11:01:45] cool cool [11:01:55] (03CR) 10Clément Goubert: [C:03+2] CommonSettings-labs: Remove api.wikimedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305380 (https://phabricator.wikimedia.org/T429372) (owner: 10Clément Goubert) [11:02:25] (03PS12) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) [11:02:53] (03Merged) 10jenkins-bot: CommonSettings-labs: Remove api.wikimedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305380 (https://phabricator.wikimedia.org/T429372) (owner: 10Clément Goubert) [11:05:24] (03PS18) 10Federico Ceratto: mysql: update replication source [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) [11:07:17] (03CR) 10Federico Ceratto: mysql: update replication source (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [11:09:57] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305374 (owner: 10Muehlenhoff) [11:10:52] (03PS1) 10Gkyziridis: ml-services: Deploy artest version of ticle-country model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305382 (https://phabricator.wikimedia.org/T429675) [11:11:19] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [11:11:26] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:20:27] (03CR) 10Ozge: [C:03+1] ml-services: Deploy artest version of ticle-country model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305382 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [11:22:26] (03PS1) 10Gkyziridis: ml-services: Bump revscoring staging images to 2026-06-23-094330-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305384 (https://phabricator.wikimedia.org/T429675) [11:23:04] (03PS1) 10Clément Goubert: beta: remove api.wikimedia.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1305385 (https://phabricator.wikimedia.org/T429372) [11:23:18] (03CR) 10Gkyziridis: "I am not sure if it would be more wise to split it in separated deployments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305384 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [11:23:26] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy artest version of ticle-country model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305382 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [11:25:37] (03Merged) 10jenkins-bot: ml-services: Deploy artest version of ticle-country model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305382 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [11:27:01] (03CR) 10Btullis: [C:03+2] Grant sudo privileges for the analytics-fr-tech-users group [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [11:27:14] (03CR) 10Zabe: [C:03+1] beta: remove api.wikimedia.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1305385 (https://phabricator.wikimedia.org/T429372) (owner: 10Clément Goubert) [11:27:35] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:27:36] (03CR) 10Clément Goubert: [C:03+2] beta: remove api.wikimedia.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1305385 (https://phabricator.wikimedia.org/T429372) (owner: 10Clément Goubert) [11:27:57] (03PS3) 10Hnowlan: redis: remove nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/1305075 (https://phabricator.wikimedia.org/T384924) [11:28:08] btullis: if you catch my change you can merge it [11:28:30] jouncebot: nowandnext [11:28:30] For the next 0 hour(s) and 31 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1100) [11:28:30] In 1 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1300) [11:28:39] hnowlan: scap is locked atm [11:28:47] I'm testing mediawiki-config stuff in beta [11:28:52] ack, nbd [11:28:57] I'm gonna roll out a new check for redis [11:29:01] ack [11:29:06] so won't conflict [11:29:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:29:54] (03CR) 10Hnowlan: [C:03+2] redis: migrate icinga checks to prometheus [alerts] - 10https://gerrit.wikimedia.org/r/1305072 (https://phabricator.wikimedia.org/T384924) (owner: 10Hnowlan) [11:32:00] (03Merged) 10jenkins-bot: redis: migrate icinga checks to prometheus [alerts] - 10https://gerrit.wikimedia.org/r/1305072 (https://phabricator.wikimedia.org/T384924) (owner: 10Hnowlan) [11:33:39] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:34:08] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:35:45] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T429116#12049836 (10Jclark-ctr) I swapped the optic on Switch B2 and also replaced the cable between the core router and the ToR switch. New Cable ID: G2210253241007206. [11:36:38] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:37:40] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304652 (owner: 10PipelineBot) [11:39:41] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:39:55] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304652 (owner: 10PipelineBot) [11:41:04] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T429116#12049844 (10Jclark-ctr) I decided to swap the cable after noticing that the interface errors had increased quite a bit over the past few days. Since the new switch was recent... [11:42:34] (03CR) 10Muehlenhoff: [C:03+2] Failover url-downloader.eqiad CNAME to one of the new Trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1305354 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [11:42:41] !log jmm@dns1004 START - running authdns-update [11:44:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:44:46] !log jmm@dns1004 END - running authdns-update [11:46:21] (03CR) 10Jcrespo: [C:03+2] versitygw: Fix service not reloading after certificate change [puppet] - 10https://gerrit.wikimedia.org/r/1305356 (https://phabricator.wikimedia.org/T430023) (owner: 10Jcrespo) [11:46:38] !log jmm@dns1004 START - running authdns-update [11:46:50] jouncebot: nowandnext [11:46:50] For the next 0 hour(s) and 13 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1100) [11:46:50] In 1 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1300) [11:47:14] Dreamy_Jazz: scap is locked [11:47:19] Thanks [11:47:21] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:47:38] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:48:11] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host db1208.eqiad.wmnet [11:48:27] !log jmm@dns1004 END - running authdns-update [11:49:28] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1026.eqiad.wmnet,service=s1 [11:49:41] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:50:21] !log marostegui@cumin1003 conftool action : set/weight=100; selector: name=clouddb1026.eqiad.wmnet [11:51:27] (03CR) 10Kosta Harlan: [C:03+1] hCaptcha: Enable for Special:Contact [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304919 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [11:52:04] Dreamy_Jazz: You may know, when running populateHomeDB.php, what wiki should I pass to mwscript? [11:52:39] Not sure, I haven't used that script before [11:52:46] Ugh [11:53:15] (03PS1) 10Muehlenhoff: profile::server_depool: Mark ganeti/test as fine to ignore [puppet] - 10https://gerrit.wikimedia.org/r/1305387 (https://phabricator.wikimedia.org/T327300) [11:53:43] !log installing postgresql security updates [11:53:46] !log ayounsi@cumin1003 START - Cookbook sre.network.depool-rack with action 'depool' for codfw rack A6 [11:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:02] ok metawiki seems to be the play [11:56:46] ayounsi@cumin1003 depool-rack (PID 2584950) is awaiting input [11:57:00] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305071 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [11:57:02] Is anything alerting for citoid? I only deployed to staging and the SLO is going crazy [11:57:14] https://grafana.wikimedia.org/goto/afq3lqap3bdhcd?orgId=1 [11:58:03] (03PS1) 10Marostegui: clouddb1026: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1305388 [11:58:46] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1305388 (owner: 10Marostegui) [11:58:49] (03CR) 10Marostegui: [C:03+2] clouddb1026: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1305388 (owner: 10Marostegui) [11:59:14] claime: halp :( [11:59:53] Mvolz: taking a look [12:00:10] ty [12:00:42] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy Qwen3.6 model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305071 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [12:01:45] Mvolz: do you see anything in logstash? [12:01:48] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 23 hosts with reason: Switch maintenance [12:01:50] pods seem ok [12:02:01] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12049887 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b3768ce5-4982-4cdb-ac8d-3735e9e5290b) set by ayounsi@cumin1003 for 2:00:00 on 23 host(s) and th... [12:02:16] (03PS1) 10Muehlenhoff: Revert "Failover url-downloader.eqiad CNAME to one of the new Trixie hosts" [dns] - 10https://gerrit.wikimedia.org/r/1305389 [12:02:47] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lsw1-a6-codfw,lsw1-a6-codfw IPv6,lsw1-a6-codfw.mgmt with reason: Switch maintenance [12:02:53] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.depool-rack (exit_code=99) with action 'depool' for codfw rack A6 [12:02:57] claime: 100% of our outgoing response codes are 504s [12:02:58] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12049888 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ac4b8e18-e54c-4708-804e-e3c84d435ded) set by ayounsi@cumin1003 for 2:00:00 on 3 host(s) and the... [12:03:04] I think this is related to url-downloader maybe [12:03:07] (03Merged) 10jenkins-bot: ml-services: Deploy Qwen3.6 model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305071 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [12:03:13] Mvolz: there's a zotero pod misbehaving as well [12:03:22] !log jmm@dns1004 START - running authdns-update [12:03:45] moritzm: in #wikimedia-serviceops is reverting a url-downloader thing I think [12:03:50] Ah, may be related to moritzm work on url-downloader [12:03:52] yeah [12:03:55] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2155: rack depool [12:03:56] ty [12:04:15] url-downloader issues have caused hCaptcha requests to drop btw [12:04:15] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2155: rack depool [12:04:27] *specifically those made to the siteverify API from our servers [12:04:29] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2156: rack depool [12:04:33] So hCaptcha has gone into failover [12:04:34] what work is being done on urldownloader? [12:05:00] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2156: rack depool [12:05:10] (The failover state is not ideal as it's not really working) [12:05:11] !log jmm@dns1004 END - running authdns-update [12:05:24] (03PS1) 10Muehlenhoff: Revert "Failover url-downloader.codfw CNAME to one of the new Trixie hosts" [dns] - 10https://gerrit.wikimedia.org/r/1305390 [12:05:56] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2067,2072-2073,2114-2115,2124-2127,2256-2257].codfw.wmnet [12:06:21] !log T423993: closing ttmserver indices in the cirrussearch opensearch cluster (eqiad & codfw) [12:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:26] T423993: Upgrade old indices in the CirrusSearch opensearch clusters - https://phabricator.wikimedia.org/T423993 [12:06:43] (03CR) 10Muehlenhoff: [C:03+2] Revert "Failover url-downloader.codfw CNAME to one of the new Trixie hosts" [dns] - 10https://gerrit.wikimedia.org/r/1305390 (owner: 10Muehlenhoff) [12:07:15] (03CR) 10Muehlenhoff: [C:03+2] Revert "Failover url-downloader.eqiad CNAME to one of the new Trixie hosts" [dns] - 10https://gerrit.wikimedia.org/r/1305389 (owner: 10Muehlenhoff) [12:07:20] !log jmm@dns1004 START - running authdns-update [12:07:22] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet [12:07:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet [12:07:38] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-staging2001.codfw.wmnet [12:09:16] !log jmm@dns1004 END - running authdns-update [12:09:27] !log jmm@dns1004 START - running authdns-update [12:09:32] (03CR) 10Jelto: [V:03+1] "@mmuhlenhoff@wikimedia.org how can we move forward here? Do you think we can test this approach for some of the collab hosts (etherpad, gi" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [12:10:07] Ok the populateHomeDB.php script takes ages on beta I don't think it's been run forever [12:10:32] (03PS8) 10Elukey: __init__: modify the management_password property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) [12:10:36] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:11:21] !log jmm@dns1004 END - running authdns-update [12:11:22] (03CR) 10Elukey: __init__: modify the management_password property (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [12:11:34] (03CR) 10Elukey: [C:03+2] Add setuptools to sphinx's tox environment [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305355 (owner: 10Elukey) [12:12:53] (03CR) 10Dreamy Jazz: [C:03+1] CheckUserGetUsersPager: Fix TypeError for numeric usernames [extensions/CheckUser] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305378 (https://phabricator.wikimedia.org/T429971) (owner: 10Kosta Harlan) [12:13:18] Mvolz: o/ did you get a task for the issue or were you checking the dashboard? [12:13:52] No it was a total coinidence, I was trying to deploy at the same time and noticed nothing was working [12:14:28] https://phabricator.wikimedia.org/T381372 re-ups the necessity for this [12:14:47] (03CR) 10Nikerabbit: [C:03+1] Enable ULS v2 by default across all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305290 (owner: 10Abijeet Patro) [12:15:17] I just stumbled upon https://phabricator.wikimedia.org/T316472 while checking the count for users without a gu_home_db and seeing it is 6909463 in prod... cc Reedy zabe [12:16:01] Makes me very unsure about running that script in prod [12:16:07] claime: that it's still happening? [12:16:13] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [12:16:27] should probably re-run that query [12:16:48] Reedy: One it's still happening, two I'm deleting a wiki so I'm gonna set that field to '' for another 10752 [12:16:52] elukey: we still have a large increase in failures that started around 8:00 utc today [12:17:05] Those are still happening post url downloader [12:17:07] Reedy: I *just* checked the count in prod [12:17:08] two separate incidents? [12:17:09] Reedy: select COUNT(*) from globaluser where gu_home_db IS NULL OR gu_home_db = ""; [12:17:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2067,2072-2073,2114-2115,2124-2127,2256-2257].codfw.wmnet [12:17:13] Reedy: 6909463 [12:17:16] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [12:17:45] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-staging2001.codfw.wmnet [12:18:03] claime: I mean grouped by year etc [12:19:13] (03CR) 10Volans: [C:03+1] "LGTM, ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [12:20:11] Reedy: 130k-180k per year starting 2023 [12:20:14] :) [12:20:55] Reedy: exact split https://phabricator.wikimedia.org/T418494#12049939 [12:21:09] (03CR) 10Elukey: [C:03+2] __init__: modify the management_password property (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304753 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [12:21:43] PROBLEM - Host gitlab-replica-b.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [12:22:07] PROBLEM - BFD status on ssw1-a1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:22:27] PROBLEM - BFD status on ssw1-a8-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:22:39] FIRING: CoreBGPDown: Core BGP session down between ssw1-a8-codfw and lsw1-a6-codfw (10.192.252.8) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-a8-codfw:9804&var-bgp_group=EVPN_IBGP&var-bgp_neighbor=lsw1-a6-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:22:40] FIRING: [12x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:23:00] Reedy: I mean I can do the UPDATE to set to null and not reassign a home wiki as well [12:23:25] I need to move forwards with at least the localuser changes because it's breaking GlobalWatchlist [12:23:44] So I'm gonna consider my beta tests successful and do that [12:23:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/5 (Core: lsw1-a6-codfw:et-0/0/55 {#230403800020}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:23:53] !log cgoubert@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: Testing apiportalwiki deletion in beta - T418494 (duration: 98m 50s) [12:23:57] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [12:26:24] mbsantos: that's gonna pull your and matmarex's patches [12:26:54] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1304845|Remove config related to the API Portal (T429372 T418494)]], [[gerrit:1305380|CommonSettings-labs: Remove api.wikimedia.beta.wmcloud.org (T429372 T418494)]] [12:27:00] T429372: Remove API Portal from WMF MediaWiki config - https://phabricator.wikimedia.org/T429372 [12:27:11] (03CR) 10Muehlenhoff: profile::reboot::unattended: add class to mark hosts for unattended reboots (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [12:27:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a6-codfw (10.192.252.8) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:27:40] FIRING: [16x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:29:02] !log cgoubert@deploy1003 apaskulin, cgoubert: Backport for [[gerrit:1304845|Remove config related to the API Portal (T429372 T418494)]], [[gerrit:1305380|CommonSettings-labs: Remove api.wikimedia.beta.wmcloud.org (T429372 T418494)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:29:07] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [12:29:41] FIRING: [5x] JobUnavailable: Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:31:10] !log cgoubert@deploy1003 apaskulin, cgoubert: Continuing with deployment [12:31:23] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host db1208.eqiad.wmnet [12:31:44] (03PS1) 10Anzx: csbwiki: update logo, wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304695 (https://phabricator.wikimedia.org/T429126) [12:31:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304695 (https://phabricator.wikimedia.org/T429126) (owner: 10Anzx) [12:33:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304001 (https://phabricator.wikimedia.org/T427917) (owner: 10Valn_ilyo) [12:33:07] RECOVERY - BFD status on ssw1-a1-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:33:27] RECOVERY - BFD status on ssw1-a8-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:33:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/5 (Core: lsw1-a6-codfw:et-0/0/55 {#230403800020}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:33:55] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-staging2001.codfw.wmnet [12:33:57] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-staging2001.codfw.wmnet [12:34:06] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet [12:34:07] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet [12:34:41] RESOLVED: [5x] JobUnavailable: Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:35:22] !log ayounsi@cumin1003 START - Cookbook sre.mysql.pool pool db2155: rack depool [12:35:28] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1304845|Remove config related to the API Portal (T429372 T418494)]], [[gerrit:1305380|CommonSettings-labs: Remove api.wikimedia.beta.wmcloud.org (T429372 T418494)]] (duration: 08m 34s) [12:35:35] T429372: Remove API Portal from WMF MediaWiki config - https://phabricator.wikimedia.org/T429372 [12:35:35] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [12:35:59] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2067,2072-2073,2114-2115,2124-2127,2256-2257].codfw.wmnet [12:36:06] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2067,2072-2073,2114-2115,2124-2127,2256-2257].codfw.wmnet [12:36:42] !log ayounsi@cumin1003 START - Cookbook sre.mysql.pool pool db2156: rack depool [12:36:50] marostegui: head's up that I'm going to do the localuser and localnames db changes for apiportalwiki deletion [12:36:55] RECOVERY - Host gitlab-replica-b.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [12:37:37] claime: cool, I'll be around if you need me [12:37:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a6-codfw (10.192.252.8) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:37:40] RESOLVED: [16x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:37:45] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:37:45] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [12:37:52] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [12:38:05] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1238: Upgrading db1238.eqiad.wmnet [12:38:30] !log Deleting apiportalwiki references in localuser table - T418494 [12:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:45] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1238: Upgrading db1238.eqiad.wmnet [12:38:54] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:38:54] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [12:39:14] (03PS1) 10Gkyziridis: ml-services: Deploy the latest version of article-country model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305392 (https://phabricator.wikimedia.org/T429675) [12:39:14] Haha of course mysql.php connects to a read-only replica [12:39:15] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2206: Upgrading db2206.codfw.wmnet [12:39:37] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2206: Upgrading db2206.codfw.wmnet [12:39:55] marostegui: Do I need to connect to a mariadb server directly or something? [12:39:56] !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2155: rack depool [12:40:40] !log fabfur@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp2043.codfw.wmnet [12:40:41] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2043.codfw.wmnet [12:40:47] !log fabfur@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp2044.codfw.wmnet [12:40:48] (03CR) 10Zaidusyy: "I tried to submit this upstream to Google Gerrit, but my Google account is bugged and giving me a 403 Permission Denied error even after s" [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305218 (https://phabricator.wikimedia.org/T429901) (owner: 10Zaidusyy) [12:40:48] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2044.codfw.wmnet [12:41:44] !log filippo@cumin1003 START - Cookbook sre.dns.netbox [12:41:45] cwilliams@cumin1003 major-upgrade (PID 2591934) is awaiting input [12:42:09] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12049997 (10ayounsi) 05Open→03Resolved All done, and all services re-pooled. [12:42:30] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-06-18-181627 to 2026-06-23-135458 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305394 (https://phabricator.wikimedia.org/T416144) [12:42:33] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-06-17-182805 to 2026-06-23-115555 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305395 (https://phabricator.wikimedia.org/T416144) [12:42:37] cwilliams@cumin1003 major-upgrade (PID 2591995) is awaiting input [12:42:48] (03PS1) 10Jforrester: wikifunctions: Double memory for evaluators from 1G to 2G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305396 [12:42:49] Ah there's a --write option [12:42:50] ofc [12:42:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2155: repool after rack maintenance [12:44:33] !log Deleting apiportalwiki references in localnames table - T418494 [12:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:38] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [12:44:45] !log repooling cp2043 and cp2044 after reimage (T419825) [12:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:49] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [12:44:50] (03PS2) 10Gerrit maintenance bot: wmnet: Update x3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1296511 (https://phabricator.wikimedia.org/T427895) [12:45:07] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp2043.* [12:45:11] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp2044.* [12:45:20] (03CR) 10CWilliams: [C:03+2] wmnet: Update x3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1296511 (https://phabricator.wikimedia.org/T427895) (owner: 10Gerrit maintenance bot) [12:45:48] (03PS4) 10Btullis: presto: Test resource groups and spill features on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) [12:45:48] (03PS4) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [12:46:18] (03CR) 10Fabfur: [C:03+2] hiera: disable awslc on esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/1305132 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [12:46:24] !log filippo@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for dumps-nfs - filippo@cumin1003" [12:46:26] !log cwilliams@dns1005 START - running authdns-update [12:46:29] !log filippo@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for dumps-nfs - filippo@cumin1003" [12:46:29] !log filippo@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:46:51] !log depooling cp3066 and cp3074 to reimage (T419825) [12:46:53] elukey@cumin1003 provision (PID 2592388) is awaiting input [12:46:55] !log Setting globaluser gu_home_db to NULL for apiportalwiki globalusers - T418494 [12:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:12] !log filippo@cumin1003 START - Cookbook sre.dns.netbox [12:47:19] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp3066.* [12:47:22] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp3066.* [12:47:26] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp3074.* [12:47:33] (03PS2) 10Ayounsi: tox: add python 3.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305345 [12:47:33] (03PS6) 10Ayounsi: netbox: add a BGP getter/setter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304554 [12:47:39] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [12:48:23] !log Deleting apiportalwiki references in GlobalUsage - T418494 [12:48:23] !log cwilliams@dns1005 END - running authdns-update [12:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:00] !log fabfur@cumin1003 START - Cookbook sre.hosts.reimage for host cp3066.esams.wmnet with OS trixie [12:49:02] !log fabfur@cumin1003 START - Cookbook sre.hosts.reimage for host cp3074.esams.wmnet with OS trixie [12:49:26] (03CR) 10Ayounsi: tox: add python 3.14 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305345 (owner: 10Ayounsi) [12:51:33] !log filippo@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for dumps-nfs - filippo@cumin1003" [12:51:37] !log filippo@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for dumps-nfs - filippo@cumin1003" [12:51:37] !log filippo@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:53:22] claime: some lag showing up in codfw [12:53:26] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [12:53:34] but it should recover soon, let me give it some downtime to avoid the p4ge [12:53:38] ack [12:54:25] claime: is it finished from your side? [12:54:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 9 hosts with reason: maintenance [12:54:34] yep I'm done with direct db edits [12:54:47] claime: oki, cool, I am still keeping an eye for the lag [12:54:59] should recover soon [12:55:23] (03CR) 10Reedy: "CC paladox" [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305218 (https://phabricator.wikimedia.org/T429901) (owner: 10Zaidusyy) [12:55:36] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [12:56:53] (03PS1) 10Marostegui: production-m3.sql.erb: Remove old grant [puppet] - 10https://gerrit.wikimedia.org/r/1305399 (https://phabricator.wikimedia.org/T423727) [12:57:40] (03CR) 10Marostegui: [C:03+2] "This is a noop until removed across the dbs directly." [puppet] - 10https://gerrit.wikimedia.org/r/1305399 (https://phabricator.wikimedia.org/T423727) (owner: 10Marostegui) [12:57:42] (03CR) 10Marostegui: [V:03+2 C:03+2] production-m3.sql.erb: Remove old grant [puppet] - 10https://gerrit.wikimedia.org/r/1305399 (https://phabricator.wikimedia.org/T423727) (owner: 10Marostegui) [12:58:10] (03CR) 10Btullis: presto: Test resource groups and spill features on the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [12:59:01] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [12:59:17] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1300). [13:00:05] nemo-yiannis, kostajh, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] (03PS1) 10Filippo Giunchedi: conftool-data: add dumps-nfs [puppet] - 10https://gerrit.wikimedia.org/r/1305402 (https://phabricator.wikimedia.org/T411248) [13:00:15] o/ [13:00:16] 👋 [13:00:17] (03PS1) 10Filippo Giunchedi: dumps: open nfs port to lb healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1305403 (https://phabricator.wikimedia.org/T411248) [13:00:20] (03PS1) 10Filippo Giunchedi: hieradata: add dumps-nfs service in service_setup state [puppet] - 10https://gerrit.wikimedia.org/r/1305404 (https://phabricator.wikimedia.org/T411248) [13:00:23] (03PS1) 10Filippo Giunchedi: dumps: add dumps-nfs service pool [puppet] - 10https://gerrit.wikimedia.org/r/1305405 (https://phabricator.wikimedia.org/T411248) [13:00:37] hi [13:00:48] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1238.eqiad.wmnet with OS trixie [13:01:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:01:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305378 (https://phabricator.wikimedia.org/T429971) (owner: 10Kosta Harlan) [13:01:32] \o [13:01:59] (03PS1) 10Filippo Giunchedi: wikimedia.org: add dumps-nfs [dns] - 10https://gerrit.wikimedia.org/r/1305406 (https://phabricator.wikimedia.org/T411248) [13:02:22] (03PS1) 10Dreamy Jazz: Handle the ConfirmEditGetGlobalInstanceFromContext hook [extensions/ContactPage] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305407 (https://phabricator.wikimedia.org/T429848) [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:33] (03PS1) 10Dreamy Jazz: Create ConfirmEditGetGlobalInstanceFromContext hook [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305408 (https://phabricator.wikimedia.org/T429848) [13:02:40] (03CR) 10CI reject: [V:04-1] Handle the ConfirmEditGetGlobalInstanceFromContext hook [extensions/ContactPage] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305407 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:02:54] (03CR) 10Dreamy Jazz: "recheck" [extensions/ContactPage] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305407 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:03:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ContactPage] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305407 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:03:19] claime: all fine [13:03:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305408 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:03:28] marostegui: ack, thanks [13:03:40] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12050131 (10elukey) @Jhancock.wm I was able to provision 2006 and 2007 with the new cookbook that I am testing, but I get connection timeouts to kafka-logging2008.s BMC. Is... [13:03:53] (03Merged) 10jenkins-bot: CheckUserGetUsersPager: Fix TypeError for numeric usernames [extensions/CheckUser] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305378 (https://phabricator.wikimedia.org/T429971) (owner: 10Kosta Harlan) [13:04:05] (03CR) 10Elukey: [C:03+1] tox: add python 3.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305345 (owner: 10Ayounsi) [13:04:29] scap is failing [13:04:33] https://spiderpig.wikimedia.org/jobs/2389 [13:04:46] 13:04:04 prep failed: Command 'git checkout --force -B master origin/master' failed with exit code 128 [13:04:48] kostajh: looking [13:04:55] claime: any ideas about this? [13:04:59] :) [13:05:00] cwilliams@cumin1003 major-upgrade (PID 2591995) is awaiting input [13:05:12] ty [13:05:42] kostajh: try again I think we just race-conditioned [13:05:56] ok, trying again [13:06:22] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1305378|CheckUserGetUsersPager: Fix TypeError for numeric usernames (T429971)]] [13:06:27] T429971: TypeError: CheckUserGetUsersPager::formatUserRow(): Argument #1 ($user_text) must be of type string, int given - https://phabricator.wikimedia.org/T429971 [13:06:50] seems to be working now, thanks [13:06:57] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2206.codfw.wmnet with OS trixie [13:07:11] (03CR) 10Kosta Harlan: [C:03+1] Create ConfirmEditGetGlobalInstanceFromContext hook [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305408 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:07:12] yeah I was testing something in mediawiki-staging and basically you ran scap just as I reset HEAD^ [13:07:31] Sorry about that :D [13:08:26] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1305378|CheckUserGetUsersPager: Fix TypeError for numeric usernames (T429971)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:11:23] !log kharlan@deploy1003 kharlan: Continuing with deployment [13:13:59] !log fabfur@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3074.esams.wmnet with reason: host reimage [13:14:55] !log fabfur@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3066.esams.wmnet with reason: host reimage [13:15:39] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305378|CheckUserGetUsersPager: Fix TypeError for numeric usernames (T429971)]] (duration: 09m 17s) [13:15:44] T429971: TypeError: CheckUserGetUsersPager::formatUserRow(): Argument #1 ($user_text) must be of type string, int given - https://phabricator.wikimedia.org/T429971 [13:15:53] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1238.eqiad.wmnet with reason: host reimage [13:16:54] Who's next [13:17:19] i need someone to deploy mine [13:17:49] nemo-yiannis: What about yours? [13:18:02] i can deploy mine [13:18:14] Do you want to go then, as you are at the top of the list [13:18:17] ok [13:18:25] (03PS2) 10MSantos: Disable parser survey for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302201 [13:18:34] (03CR) 10Jgiannelos: [C:03+2] Disable parser survey for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302201 (owner: 10MSantos) [13:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:19:28] (03Merged) 10jenkins-bot: Disable parser survey for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302201 (owner: 10MSantos) [13:19:57] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3074.esams.wmnet with reason: host reimage [13:20:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304919 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:21:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [13:21:08] ok merged, deploying mine [13:21:48] !log jgiannelos@deploy1003 Started scap sync-world: Backport for [[gerrit:1302201|Disable parser survey for all wikis]] [13:22:15] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2156: rack depool [13:23:52] !log jgiannelos@deploy1003 mbsantos, jgiannelos: Backport for [[gerrit:1302201|Disable parser survey for all wikis]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:24:42] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3066.esams.wmnet with reason: host reimage [13:26:15] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2206.codfw.wmnet with reason: host reimage [13:26:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12050252 (10ayounsi) [13:28:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2155: repool after rack maintenance [13:28:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1238.eqiad.wmnet with reason: host reimage [13:29:04] (03PS5) 10Btullis: presto: Test resource groups and spill features on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) [13:29:04] (03PS5) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [13:29:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-logging2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:30:46] 07Puppet, 06Release-Engineering-Team: registry-homepage-builder.py doesn't sort images as expected - https://phabricator.wikimedia.org/T388287#12050271 (10hashar) The pages got generated. I went to purge the example page: ` $ mwscript purgeList --wiki=aawiki https://docker-registry.wikimedia.org/releng/node22-... [13:30:56] 07Puppet, 06Release-Engineering-Team: registry-homepage-builder.py doesn't sort images as expected - https://phabricator.wikimedia.org/T388287#12050274 (10hashar) 05Open→03Resolved [13:31:51] 06SRE: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045 (10MoritzMuehlenhoff) 03NEW [13:32:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:32:36] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12050301 (10kostajh) [13:32:50] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2206.codfw.wmnet with reason: host reimage [13:33:05] (03CR) 10Ayounsi: [C:03+2] tox: add python 3.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305345 (owner: 10Ayounsi) [13:33:11] (03CR) 10Ayounsi: [C:03+2] netbox: add a BGP getter/setter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304554 (owner: 10Ayounsi) [13:33:36] !log jgiannelos@deploy1003 mbsantos, jgiannelos: Continuing with deployment [13:33:38] (03CR) 10Ayounsi: [C:03+2] profile::server_depool: Mark ganeti/test as fine to ignore [puppet] - 10https://gerrit.wikimedia.org/r/1305387 (https://phabricator.wikimedia.org/T327300) (owner: 10Muehlenhoff) [13:33:54] (03CR) 10Ayounsi: [C:03+2] Add depool policy for VTRS [puppet] - 10https://gerrit.wikimedia.org/r/1305350 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [13:36:52] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12050343 (10MLechvien-WMF) [13:37:07] (03PS1) 10Ayounsi: profile::server_depool for memcache and k8s master [puppet] - 10https://gerrit.wikimedia.org/r/1305422 (https://phabricator.wikimedia.org/T327300) [13:37:50] !log jgiannelos@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302201|Disable parser survey for all wikis]] (duration: 16m 01s) [13:37:52] (03CR) 10Ayounsi: [C:03+2] "self merging as noop" [puppet] - 10https://gerrit.wikimedia.org/r/1305422 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [13:37:55] (03Merged) 10jenkins-bot: tox: add python 3.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1305345 (owner: 10Ayounsi) [13:37:56] (03Merged) 10jenkins-bot: netbox: add a BGP getter/setter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304554 (owner: 10Ayounsi) [13:38:05] okay done [13:40:02] (03PS21) 10Ayounsi: diffscan: pynotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 (https://phabricator.wikimedia.org/T415347) (owner: 10Jbond) [13:40:13] (03CR) 10Ayounsi: [C:03+2] diffscan: pynotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 (https://phabricator.wikimedia.org/T415347) (owner: 10Jbond) [13:41:14] (03CR) 10Ayounsi: [C:03+2] diffscan: pynotnify (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634572 (https://phabricator.wikimedia.org/T415347) (owner: 10Jbond) [13:41:16] Okay, up next [13:41:46] I'll take a look at the config patches [13:42:10] (03PS6) 10Btullis: presto: Test resource groups and spill features on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) [13:42:10] (03PS6) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [13:45:12] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate diffscan VM to Trixie - https://phabricator.wikimedia.org/T415347#12050399 (10ayounsi) Updated script merged, old instance powered down, the v4 public IP needs to be moved but that's not a blocker. Then I'll monitor for a few days. [13:45:31] (03PS4) 10Dreamy Jazz: hCaptcha: Enable for Special:Contact [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304919 (https://phabricator.wikimedia.org/T429848) [13:45:46] 06SRE, 10SRE-swift-storage, 07Essential-Work, 13Patch-For-Review: Migrate production swift clusters to trixie - https://phabricator.wikimedia.org/T429630#12050411 (10MatthewVernon) [13:45:53] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1238.eqiad.wmnet with OS trixie [13:46:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304001 (https://phabricator.wikimedia.org/T427917) (owner: 10Valn_ilyo) [13:46:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304695 (https://phabricator.wikimedia.org/T429126) (owner: 10Anzx) [13:46:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ContactPage] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305407 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:46:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304919 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:46:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305408 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:46:52] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3074.esams.wmnet with OS trixie [13:47:31] (03Merged) 10jenkins-bot: Fix autonym for Khasi (kha) in wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304001 (https://phabricator.wikimedia.org/T427917) (owner: 10Valn_ilyo) [13:47:35] (03Merged) 10jenkins-bot: csbwiki: update logo, wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304695 (https://phabricator.wikimedia.org/T429126) (owner: 10Anzx) [13:47:38] (03Merged) 10jenkins-bot: hCaptcha: Enable for Special:Contact [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304919 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:48:28] (03PS5) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [13:49:08] (03PS7) 10Btullis: presto: Test resource groups and spill features on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) [13:49:08] (03PS7) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [13:49:41] (03Merged) 10jenkins-bot: Create ConfirmEditGetGlobalInstanceFromContext hook [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305408 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:49:43] (03Merged) 10jenkins-bot: Handle the ConfirmEditGetGlobalInstanceFromContext hook [extensions/ContactPage] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305407 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:50:17] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1304001|Fix autonym for Khasi (kha) in wmgExtraLanguageNames (T427917)]], [[gerrit:1304695|csbwiki: update logo, wordmark and tagline (T429126)]], [[gerrit:1305407|Handle the ConfirmEditGetGlobalInstanceFromContext hook (T429848)]], [[gerrit:1304919|hCaptcha: Enable for Special:Contact (T429848)]], [[gerrit:1305408|Create ConfirmEditGetGlobalInstanc [13:50:17] eFromContext hook (T429848)]] [13:50:22] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2206.codfw.wmnet with OS trixie [13:50:25] T427917: Add monolingual language code kha (khasi language) - https://phabricator.wikimedia.org/T427917 [13:50:26] T429126: Change name of Kashubian Wikipedia from Wikipedijô to Wikipediô - https://phabricator.wikimedia.org/T429126 [13:50:26] T429848: hCaptcha: Use hCaptcha for contact pages on metawiki - https://phabricator.wikimedia.org/T429848 [13:50:26] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12050429 (10Jhancock.wm) @elukey i checked the cabling. reseated everything and rebooted the server. it still looked fine. I double checked the luggage tag and i had the wr... [13:51:15] (03CR) 10Btullis: presto: Test resource groups and spill features on the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [13:51:24] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3066.esams.wmnet with OS trixie [13:51:47] 06SRE, 10Citoid: citoid failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430053 (10MoritzMuehlenhoff) 03NEW [13:52:24] !log dreamyjazz@deploy1003 dreamyjazz, valn-ilyo, anzx: Backport for [[gerrit:1304001|Fix autonym for Khasi (kha) in wmgExtraLanguageNames (T427917)]], [[gerrit:1304695|csbwiki: update logo, wordmark and tagline (T429126)]], [[gerrit:1305407|Handle the ConfirmEditGetGlobalInstanceFromContext hook (T429848)]], [[gerrit:1304919|hCaptcha: Enable for Special:Contact (T429848)]], [[gerrit:1305408|Create ConfirmEditGetGlobalIns [13:52:24] tanceFromContext hook (T429848)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:52:39] checking [13:52:42] Thanks [13:53:29] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [13:53:39] (03CR) 10Jelto: [V:03+1] profile::base::reboot_unattended: add class to mark hosts for unattended reboots (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [13:53:43] Dreamy_Jazz: looks good, ok to sync [13:54:41] Thanks, I'm still testing mine [13:55:10] !log dreamyjazz@deploy1003 dreamyjazz, valn-ilyo, anzx: Continuing with deployment [13:55:30] (03PS1) 10Dreamy Jazz: Handle the ConfirmEditGetGlobalInstanceFromContext hook [extensions/ContactPage] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305430 (https://phabricator.wikimedia.org/T429848) [13:55:43] (03PS1) 10Dreamy Jazz: Create ConfirmEditGetGlobalInstanceFromContext hook [extensions/ConfirmEdit] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305431 (https://phabricator.wikimedia.org/T429848) [13:55:50] (03CR) 10Muehlenhoff: profile::base::reboot_unattended: add class to mark hosts for unattended reboots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [13:55:50] (03CR) 10CI reject: [V:04-1] Handle the ConfirmEditGetGlobalInstanceFromContext hook [extensions/ContactPage] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305430 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:56:21] (03CR) 10Dreamy Jazz: "recheck" [extensions/ContactPage] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305430 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [13:56:28] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [13:56:36] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [13:57:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:59:30] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1304001|Fix autonym for Khasi (kha) in wmgExtraLanguageNames (T427917)]], [[gerrit:1304695|csbwiki: update logo, wordmark and tagline (T429126)]], [[gerrit:1305407|Handle the ConfirmEditGetGlobalInstanceFromContext hook (T429848)]], [[gerrit:1304919|hCaptcha: Enable for Special:Contact (T429848)]], [[gerrit:1305408|Create ConfirmEditGetGlobalInstan [13:59:30] ceFromContext hook (T429848)]] (duration: 09m 13s) [13:59:38] T427917: Add monolingual language code kha (khasi language) - https://phabricator.wikimedia.org/T427917 [13:59:39] T429126: Change name of Kashubian Wikipedia from Wikipedijô to Wikipediô - https://phabricator.wikimedia.org/T429126 [13:59:39] T429848: hCaptcha: Use hCaptcha for contact pages on metawiki - https://phabricator.wikimedia.org/T429848 [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1400) [14:00:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ContactPage] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305430 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [14:00:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305431 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [14:01:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1238: Migration of db1238.eqiad.wmnet completed [14:01:45] Dreamy_Jazz: Don't worry, we're not using the MW side of the deployment window. [14:02:27] Thanks, needed to backport to wmf.7 but only realised during the test stage :D [14:02:42] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-06-18-181627 to 2026-06-23-135458 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305394 (https://phabricator.wikimedia.org/T416144) (owner: 10Jforrester) [14:02:50] Always the way! [14:03:02] Maybe we need the backport windows to be 24 hours long, then there is never a worry about going over :D [14:03:22] (03PS3) 10Btullis: Remove the job that synced the phab dumps to the clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1245419 (https://phabricator.wikimedia.org/T417824) [14:04:12] Dreamy_Jazz: Thanks for deploying [14:04:16] Np [14:04:35] (03PS13) 10Btullis: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [14:04:58] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-06-18-181627 to 2026-06-23-135458 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305394 (https://phabricator.wikimedia.org/T416144) (owner: 10Jforrester) [14:05:01] !log fabfur@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp3066.esams.wmnet [14:05:01] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp3066.esams.wmnet [14:05:08] !log fabfur@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp3074.esams.wmnet [14:05:09] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp3074.esams.wmnet [14:05:28] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2206: Migration of db2206.codfw.wmnet completed [14:06:35] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:07:30] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:08:41] (03Merged) 10jenkins-bot: Create ConfirmEditGetGlobalInstanceFromContext hook [extensions/ConfirmEdit] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305431 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [14:08:43] (03Merged) 10jenkins-bot: Handle the ConfirmEditGetGlobalInstanceFromContext hook [extensions/ContactPage] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305430 (https://phabricator.wikimedia.org/T429848) (owner: 10Dreamy Jazz) [14:08:59] !log apine@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:09:13] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1305430|Handle the ConfirmEditGetGlobalInstanceFromContext hook (T429848)]], [[gerrit:1305431|Create ConfirmEditGetGlobalInstanceFromContext hook (T429848)]] [14:09:18] T429848: hCaptcha: Use hCaptcha for contact pages on metawiki - https://phabricator.wikimedia.org/T429848 [14:09:35] !log apine@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:09:42] !log apine@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:10:06] (03PS1) 10Kamila Součková: shellbox: pick up new images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305432 (https://phabricator.wikimedia.org/T385404) [14:10:10] !log apine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:10:15] (03CR) 10CI reject: [V:04-1] shellbox: pick up new images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305432 (https://phabricator.wikimedia.org/T385404) (owner: 10Kamila Součková) [14:11:10] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-06-17-182805 to 2026-06-23-115555 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305395 (https://phabricator.wikimedia.org/T416144) (owner: 10Jforrester) [14:11:18] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1305430|Handle the ConfirmEditGetGlobalInstanceFromContext hook (T429848)]], [[gerrit:1305431|Create ConfirmEditGetGlobalInstanceFromContext hook (T429848)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:12:08] (03PS2) 10Kamila Součková: shellbox: pick up new images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305432 (https://phabricator.wikimedia.org/T385404) [14:12:47] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [14:13:23] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-06-17-182805 to 2026-06-23-115555 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305395 (https://phabricator.wikimedia.org/T416144) (owner: 10Jforrester) [14:14:01] (03Abandoned) 10Bernard Wang: Restore menu tab underline style [skins/Vector] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305191 (https://phabricator.wikimedia.org/T428519) (owner: 10Jdlrobson) [14:14:23] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:14:28] PROBLEM - Host db1208 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:41] (03Restored) 10Bernard Wang: Restore menu tab underline style [skins/Vector] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305191 (https://phabricator.wikimedia.org/T428519) (owner: 10Jdlrobson) [14:15:00] (03CR) 10Bernard Wang: [C:03+1] "sorry! didn’t realize this was a backport patch" [skins/Vector] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305191 (https://phabricator.wikimedia.org/T428519) (owner: 10Jdlrobson) [14:15:04] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:15:32] !log apine@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:15:56] RECOVERY - Host db1208 is UP: PING OK - Packet loss = 0%, RTA = 7.76 ms [14:16:00] PROBLEM - MariaDB read only matomo on db1208 is CRITICAL: Could not connect to localhost:3351 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:16:10] !log apine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:16:10] RECOVERY - MariaDB read only analytics_meta on db1208 is OK: Version 10.6.18-MariaDB-log, Uptime 21s, read_only: True, event_scheduler: True, 2582.64 QPS, connection latency: 0.041477s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:16:17] !log apine@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:16:34] RECOVERY - MariaDB disk space on db1208 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:16:34] RECOVERY - MariaDB Replica IO: analytics_meta on db1208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [14:16:34] PROBLEM - MariaDB Replica Lag: matomo on db1208 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [14:16:34] PROBLEM - mysqld processes on db1208 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:16:34] PROBLEM - MariaDB Replica SQL: matomo on db1208 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [14:16:35] PROBLEM - MariaDB Replica IO: matomo on db1208 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [14:16:36] RECOVERY - MariaDB Replica SQL: analytics_meta on db1208 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [14:16:36] PROBLEM - MariaDB Replica Lag: analytics_meta on db1208 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 42341.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [14:16:59] !log apine@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:17:01] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305430|Handle the ConfirmEditGetGlobalInstanceFromContext hook (T429848)]], [[gerrit:1305431|Create ConfirmEditGetGlobalInstanceFromContext hook (T429848)]] (duration: 07m 48s) [14:17:06] T429848: hCaptcha: Use hCaptcha for contact pages on metawiki - https://phabricator.wikimedia.org/T429848 [14:17:23] (03PS1) 10Slyngshede: P:cache::haproxy image provenance hashing [puppet] - 10https://gerrit.wikimedia.org/r/1305433 (https://phabricator.wikimedia.org/T414338) [14:17:30] !log Afternoon UTC backport window done [14:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:34] RECOVERY - MariaDB Replica Lag: analytics_meta on db1208 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [14:19:25] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Double memory for evaluators from 1G to 2G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305396 (owner: 10Jforrester) [14:19:42] (03PS2) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [14:20:16] (03PS2) 10Klausman: hiera: Switch ml-staging k8s to Maglev LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1305397 (https://phabricator.wikimedia.org/T420438) [14:21:42] (03Merged) 10jenkins-bot: wikifunctions: Double memory for evaluators from 1G to 2G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305396 (owner: 10Jforrester) [14:22:34] (03PS6) 10Jelto: profile::base: add parameter to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [14:23:03] 06SRE, 06Infrastructure-Foundations: Adding Jesse to approvers for Bitu - https://phabricator.wikimedia.org/T430059 (10MoritzMuehlenhoff) 03NEW [14:23:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:24:45] (03PS1) 10Eric Gardner: Restore the per-reader opt-out for the mobile image carousel [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305436 (https://phabricator.wikimedia.org/T419786) [14:24:55] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:24:56] (03PS1) 10Eric Gardner: Restore the per-reader opt-out for the mobile image carousel [extensions/MultimediaViewer] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305437 (https://phabricator.wikimedia.org/T419786) [14:24:59] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:25:22] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T429116#12050708 (10Jclark-ctr) 05Open→03Resolved graph looks to have cleaned up since replacement of cable and optic. Closing ticket for now {F90294556} [14:25:23] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:25:27] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:25:37] !log apine@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:25:42] James_F: we have another urgent thing, once again coinciding with wikifunctions services deployment slot; would it be fine to do another emergency scap once again? [14:25:52] !log repooling cp3066 and cp3074 after reimage (T419825) [14:25:53] matthiasmullie: go for it. [14:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:57] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [14:26:00] I came here to ask the same thing [14:26:00] We're just doing a last services push. [14:26:04] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp3066.* [14:26:07] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp3066.* [14:26:07] !log apine@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:26:12] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp3074.* [14:26:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:26:14] !log apine@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:26:21] !log fabfur@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp3066.esams.wmnet [14:26:22] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp3066.esams.wmnet [14:26:28] !log fabfur@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp3074.esams.wmnet [14:26:29] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp3074.esams.wmnet [14:26:49] !log apine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:27:01] James_F: thanks! [14:27:42] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8771/co" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [14:28:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:29:37] (03CR) 10Jforrester: "Yay." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305432 (https://phabricator.wikimedia.org/T385404) (owner: 10Kamila Součková) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1430) [14:31:09] (03CR) 10Jelto: [V:03+1] profile::base: add parameter to mark hosts for unattended reboots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [14:31:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:32:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305436 (https://phabricator.wikimedia.org/T419786) (owner: 10Eric Gardner) [14:32:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305437 (https://phabricator.wikimedia.org/T419786) (owner: 10Eric Gardner) [14:34:01] (03Merged) 10jenkins-bot: Restore the per-reader opt-out for the mobile image carousel [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1305436 (https://phabricator.wikimedia.org/T419786) (owner: 10Eric Gardner) [14:34:02] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [14:34:03] (03Merged) 10jenkins-bot: Restore the per-reader opt-out for the mobile image carousel [extensions/MultimediaViewer] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305437 (https://phabricator.wikimedia.org/T419786) (owner: 10Eric Gardner) [14:34:31] !log mfossati@deploy1003 Started scap sync-world: Backport for [[gerrit:1305436|Restore the per-reader opt-out for the mobile image carousel (T419786)]], [[gerrit:1305437|Restore the per-reader opt-out for the mobile image carousel (T419786)]] [14:34:37] T419786: Image Browsing: Opt-out mechanisms - https://phabricator.wikimedia.org/T419786 [14:41:53] (03CR) 10Muehlenhoff: "Looks good, one additional comment inline. And let's doublecheck with a PCC run against" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [14:46:10] !log installing postgresql security updates [14:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:07] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1238: Migration of db1238.eqiad.wmnet completed [14:47:08] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:50:23] (03CR) 10Scott French: [C:03+1] shellbox: pick up new images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305432 (https://phabricator.wikimedia.org/T385404) (owner: 10Kamila Součková) [14:50:59] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2206: Migration of db2206.codfw.wmnet completed [14:51:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:51:18] (03CR) 10Btullis: [C:03+2] Remove the job that synced the phab dumps to the clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1245419 (https://phabricator.wikimedia.org/T417824) (owner: 10Btullis) [14:53:39] !log mfossati@deploy1003 egardner, mfossati: Backport for [[gerrit:1305436|Restore the per-reader opt-out for the mobile image carousel (T419786)]], [[gerrit:1305437|Restore the per-reader opt-out for the mobile image carousel (T419786)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:53:43] T419786: Image Browsing: Opt-out mechanisms - https://phabricator.wikimedia.org/T419786 [14:53:58] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM! Thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:54:09] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM! Thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:54:45] (03PS14) 10Btullis: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [14:54:50] (03CR) 10Atsuko: [C:03+1] "+1!" [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:54:54] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12050803 (10MoritzMuehlenhoff) [14:55:16] !og installing libarchive security updates [14:57:19] !log mfossati@deploy1003 egardner, mfossati: Continuing with deployment [14:58:29] (03CR) 10Btullis: [C:03+2] presto: Test resource groups and spill features on the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1305108 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:01:17] 06SRE, 06Infrastructure-Foundations: Adding Jesse to approvers for Bitu - https://phabricator.wikimedia.org/T430059#12050822 (10LSobanski) Approved. [15:02:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[3456] - https://phabricator.wikimedia.org/T419892#12050824 (10Jclark-ctr) a:03Andrew Is there a ticket tracking the rebalancing of CloudVirts for power to make room? Has any decision been made regarding... [15:02:35] (03CR) 10Scott French: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1305358 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [15:03:01] (03CR) 10Kamila Součková: [C:03+2] shellbox: pick up new images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305432 (https://phabricator.wikimedia.org/T385404) (owner: 10Kamila Součková) [15:03:52] jouncebot: nowandnext [15:03:52] No deployments scheduled for the next 1 hour(s) and 56 minute(s) [15:03:52] In 1 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1700) [15:05:37] (03Merged) 10jenkins-bot: shellbox: pick up new images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305432 (https://phabricator.wikimedia.org/T385404) (owner: 10Kamila Součková) [15:07:49] (03PS1) 10Clément Goubert: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305442 (https://phabricator.wikimedia.org/T429372) [15:08:07] (03PS1) 10Muehlenhoff: Add Jesse to Bitu approvers [puppet] - 10https://gerrit.wikimedia.org/r/1305443 (https://phabricator.wikimedia.org/T430059) [15:08:23] (03PS1) 10Jdlrobson: Replace Tools button with vertical ellipsis [skins/Vector] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305444 (https://phabricator.wikimedia.org/T429258) [15:09:19] !log kamila@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [15:09:57] !log kamila@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [15:10:03] !log kamila@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [15:10:23] !log mfossati@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305436|Restore the per-reader opt-out for the mobile image carousel (T419786)]], [[gerrit:1305437|Restore the per-reader opt-out for the mobile image carousel (T419786)]] (duration: 35m 52s) [15:10:26] !log kamila@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [15:10:28] T419786: Image Browsing: Opt-out mechanisms - https://phabricator.wikimedia.org/T419786 [15:10:32] !log kamila@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [15:10:57] !log kamila@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [15:11:03] !log kamila@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:11:22] !log kamila@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:11:28] !log kamila@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [15:11:57] !log kamila@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [15:12:03] !log kamila@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [15:12:30] (03CR) 10Zabe: [C:03+1] Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305442 (https://phabricator.wikimedia.org/T429372) (owner: 10Clément Goubert) [15:12:49] (03PS3) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [15:12:51] (03CR) 10Jforrester: [C:03+1] Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305442 (https://phabricator.wikimedia.org/T429372) (owner: 10Clément Goubert) [15:14:07] !log kamila@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [15:14:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305442 (https://phabricator.wikimedia.org/T429372) (owner: 10Clément Goubert) [15:14:57] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [15:15:15] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [15:15:37] (03Merged) 10jenkins-bot: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305442 (https://phabricator.wikimedia.org/T429372) (owner: 10Clément Goubert) [15:16:06] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1305442|Update interwiki map (T429372 T418494)]] [15:16:13] T429372: Remove API Portal from WMF MediaWiki config - https://phabricator.wikimedia.org/T429372 [15:16:13] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [15:20:36] !log cgoubert@deploy1003 cgoubert: Backport for [[gerrit:1305442|Update interwiki map (T429372 T418494)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:21:12] !log cgoubert@deploy1003 cgoubert: Continuing with deployment [15:23:23] (03PS2) 10Clare Ming: Test Kitchen UI: Deploy v1.4.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305289 (https://phabricator.wikimedia.org/T428984) [15:25:02] (03PS2) 10Clare Ming: Test Kitchen UI: Deploy v1.4.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305288 (https://phabricator.wikimedia.org/T428984) [15:25:49] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [15:25:55] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [15:26:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2003 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:26:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:26:55] (03PS4) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [15:27:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [15:28:46] (03PS1) 10Brouberol: global_config: register phabricator in the external-services [puppet] - 10https://gerrit.wikimedia.org/r/1305449 (https://phabricator.wikimedia.org/T430024) [15:29:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:29:50] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [15:29:56] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [15:30:15] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305442|Update interwiki map (T429372 T418494)]] (duration: 14m 09s) [15:30:21] T429372: Remove API Portal from WMF MediaWiki config - https://phabricator.wikimedia.org/T429372 [15:30:22] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [15:30:35] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [15:30:41] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:31:50] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:31:56] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [15:32:30] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [15:32:36] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [15:33:36] Dreamy_Jazz: Any way the backports fro ConfirmEditGetGlobalInstanceFromContext you did earlier would vause worker usage to juump 25%? [15:33:42] s/vause/cause/ [15:33:47] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [15:33:55] Dreamy_Jazz: https://grafana.wikimedia.org/goto/efq4525qr55vke?orgId=1 [15:35:48] (03CR) 10Bking: [C:03+1] global_config: register phabricator in the external-services [puppet] - 10https://gerrit.wikimedia.org/r/1305449 (https://phabricator.wikimedia.org/T430024) (owner: 10Brouberol) [15:36:10] I wouldn't have thought that could cause the jump [15:36:28] The changes should only apply to Special:Contact on meta.wikimedia.org [15:36:41] Dreamy_Jazz: ok looking for another cause then [15:36:47] (03PS1) 10Kamila Součková: admin: increase shellbox CPU limit quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305457 (https://phabricator.wikimedia.org/T385404) [15:36:57] (03CR) 10AikoChou: ml-services: Deploy artest version of ticle-country model on staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305382 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [15:37:20] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:37:29] (03CR) 10Santiago Faci: [C:03+2] Remove saved groups config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305287 (https://phabricator.wikimedia.org/T429959) (owner: 10Clare Ming) [15:38:20] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [15:38:20] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [15:38:40] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1241: Upgrading db1241.eqiad.wmnet [15:38:40] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [15:38:40] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [15:39:00] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2210: Upgrading db2210.codfw.wmnet [15:39:10] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1241: Upgrading db1241.eqiad.wmnet [15:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:39:28] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [15:39:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2210: Upgrading db2210.codfw.wmnet [15:39:45] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:39:53] (03Merged) 10jenkins-bot: Remove saved groups config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305287 (https://phabricator.wikimedia.org/T429959) (owner: 10Clare Ming) [15:40:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:40:28] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [15:41:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0% idle #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:41:22] Yeah on it [15:41:26] !incidents [15:41:26] 8096 (UNACKED) PHPFPMTooBusy sre (mw-api-ext main eqiad) [15:41:29] !ack 8096 [15:41:29] 8096 (ACKED) PHPFPMTooBusy sre (mw-api-ext main eqiad) [15:42:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:42:16] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:43:16] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:43:23] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:43:41] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1241.eqiad.wmnet with OS trixie [15:44:04] * Raine is slightly suspicious of DB upgrades [15:44:13] cwilliams@cumin1003 major-upgrade (PID 2620532) is awaiting input [15:44:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 1.543s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:45:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:45:24] * Raine is also slightly suspicious of dumps that needed to be restarted because bookworm made them unhappy [15:45:35] (03PS1) 10Brouberol: phabricator: enable egress from the dse kubepods networks [puppet] - 10https://gerrit.wikimedia.org/r/1305460 (https://phabricator.wikimedia.org/T430024) [15:46:43] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [15:46:49] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [15:47:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:47:31] * Raine actually asked and dumps shouldn't be using mw-api-ext [15:48:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:49:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.09s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:49:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:24] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:49:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 5.926% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:50:53] (03CR) 10Santiago Faci: Test Kitchen UI: Deploy v1.4.5 release to staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305288 (https://phabricator.wikimedia.org/T428984) (owner: 10Clare Ming) [15:50:58] (03CR) 10Santiago Faci: Test Kitchen UI: Deploy v1.4.5 release to production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305289 (https://phabricator.wikimedia.org/T428984) (owner: 10Clare Ming) [15:51:12] Raine: I am not seeing much in the way of errors relating to s4 - still concerned? [15:51:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0% idle #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:51:34] cezmunsta: no, resolved we think, thank you! [15:52:06] * cezmunsta sighs with relief [15:52:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:52:16] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy artest version of ticle-country model on staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305382 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [15:52:16] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [15:52:23] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:52:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[3456] - https://phabricator.wikimedia.org/T419892#12051057 (10fgiunchedi) Yes the cloudvirt rebalancing task is {T424658} and the last table in the description lists the host moves to get to balance. I'll... [15:52:50] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:52:56] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [15:52:56] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2210.codfw.wmnet with OS trixie [15:53:02] (03PS8) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [15:53:02] (03PS1) 10Btullis: presto: update the properties for spilling [puppet] - 10https://gerrit.wikimedia.org/r/1305462 (https://phabricator.wikimedia.org/T424112) [15:53:16] (03PS2) 10Btullis: presto: update the properties for spilling [puppet] - 10https://gerrit.wikimedia.org/r/1305462 (https://phabricator.wikimedia.org/T424112) [15:53:23] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [15:53:25] (03PS9) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [15:53:30] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [15:54:31] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [15:56:31] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:57:48] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12051106 (10elukey) All three provisioned, I'll reimage them tomorrow. Note that they need a new version of reimage as well. [15:58:49] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1241.eqiad.wmnet with reason: host reimage [16:01:56] (03PS1) 10Btullis: Temporarily remove dse-k8s-worker101[567] from service [puppet] - 10https://gerrit.wikimedia.org/r/1305467 (https://phabricator.wikimedia.org/T429773) [16:02:29] (03CR) 10CI reject: [V:04-1] Temporarily remove dse-k8s-worker101[567] from service [puppet] - 10https://gerrit.wikimedia.org/r/1305467 (https://phabricator.wikimedia.org/T429773) (owner: 10Btullis) [16:04:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1241.eqiad.wmnet with reason: host reimage [16:05:00] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#12051135 (10GPSLeo) >>! In T427949#12039056, @jcrespo wrote: > My suggestion was to build a dedicated, but separate, repository for professional g... [16:06:27] (03PS2) 10Btullis: Temporarily remove dse-k8s-worker101[567] from service [puppet] - 10https://gerrit.wikimedia.org/r/1305467 (https://phabricator.wikimedia.org/T429773) [16:06:57] (03PS7) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [16:08:41] (03PS8) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [16:09:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:32] (03CR) 10Btullis: [C:03+2] presto: update the properties for spilling [puppet] - 10https://gerrit.wikimedia.org/r/1305462 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [16:10:59] (03CR) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [16:12:09] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2210.codfw.wmnet with reason: host reimage [16:12:36] RECOVERY - mysqld processes on db1208 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:13:02] RECOVERY - MariaDB read only matomo on db1208 is OK: Version 10.6.18-MariaDB-log, Uptime 27s, read_only: True, event_scheduler: True, 11.23 QPS, connection latency: 0.041757s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:13:36] RECOVERY - MariaDB Replica SQL: matomo on db1208 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [16:14:36] RECOVERY - MariaDB Replica IO: matomo on db1208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [16:14:36] PROBLEM - MariaDB Replica Lag: matomo on db1208 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 18724.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [16:14:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:46] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [16:15:37] RECOVERY - MariaDB Replica Lag: matomo on db1208 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [16:17:16] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [16:18:53] (03PS9) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [16:19:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:19:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2210.codfw.wmnet with reason: host reimage [16:21:50] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1241.eqiad.wmnet with OS trixie [16:23:01] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 6 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [16:23:19] (03PS10) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [16:26:29] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [16:30:43] (03CR) 10Scott French: [C:03+1] admin: increase shellbox CPU limit quota (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305457 (https://phabricator.wikimedia.org/T385404) (owner: 10Kamila Součková) [16:33:33] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1241: Migration of db1241.eqiad.wmnet completed [16:37:10] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:37:14] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2210.codfw.wmnet with OS trixie [16:37:21] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:41:36] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:41:45] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:45:15] (03PS3) 10Btullis: Temporarily remove dse-k8s-worker101[567] from service [puppet] - 10https://gerrit.wikimedia.org/r/1305467 (https://phabricator.wikimedia.org/T429773) [16:51:02] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2210: Migration of db2210.codfw.wmnet completed [16:56:08] (03PS4) 10Btullis: Temporarily remove dse-k8s-worker101[567] from service [puppet] - 10https://gerrit.wikimedia.org/r/1305467 (https://phabricator.wikimedia.org/T429773) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1700) [17:00:21] (03PS3) 10Clare Ming: Test Kitchen UI: Deploy v1.4.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305288 (https://phabricator.wikimedia.org/T428984) [17:01:39] (03PS3) 10Clare Ming: Test Kitchen UI: Deploy v1.4.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305289 (https://phabricator.wikimedia.org/T428984) [17:01:54] (03CR) 10Clare Ming: Test Kitchen UI: Deploy v1.4.5 release to staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305288 (https://phabricator.wikimedia.org/T428984) (owner: 10Clare Ming) [17:02:04] (03CR) 10Clare Ming: Test Kitchen UI: Deploy v1.4.5 release to production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305289 (https://phabricator.wikimedia.org/T428984) (owner: 10Clare Ming) [17:02:49] (03PS2) 10Pushpaktiwari: T429269: Send logged-in experiment events to ins-502b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303490 [17:05:53] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [17:06:10] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [17:06:13] (03CR) 10Pushpaktiwari: T429269: Send logged-in experiment events to ins-502b (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303490 (owner: 10Pushpaktiwari) [17:15:05] (03CR) 10Bking: [C:03+1] Temporarily remove dse-k8s-worker101[567] from service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1305467 (https://phabricator.wikimedia.org/T429773) (owner: 10Btullis) [17:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:19:00] (03PS5) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [17:19:04] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1241: Migration of db1241.eqiad.wmnet completed [17:19:05] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [17:19:35] (03CR) 10CI reject: [V:04-1] opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [17:21:03] (03PS1) 10CDanis: turnilo: drop kind:number from X-Is-Browser dim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305478 [17:23:59] (03CR) 10Kamila Součková: [C:03+2] admin: increase shellbox CPU limit quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305457 (https://phabricator.wikimedia.org/T385404) (owner: 10Kamila Součková) [17:25:27] (03PS6) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [17:26:20] (03CR) 10Btullis: [C:03+2] Temporarily remove dse-k8s-worker101[567] from service [puppet] - 10https://gerrit.wikimedia.org/r/1305467 (https://phabricator.wikimedia.org/T429773) (owner: 10Btullis) [17:28:34] (03CR) 10Scott French: [C:03+1] turnilo: drop kind:number from X-Is-Browser dim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305478 (owner: 10CDanis) [17:28:37] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T430072 [17:29:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [17:32:25] (03Merged) 10jenkins-bot: admin: increase shellbox CPU limit quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305457 (https://phabricator.wikimedia.org/T385404) (owner: 10Kamila Součková) [17:34:14] !log kamila@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:34:51] !log kamila@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:35:04] !log kamila@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:35:39] !log kamila@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:36:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2210: Migration of db2210.codfw.wmnet completed [17:36:33] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [17:36:45] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [17:38:15] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T430072 [17:38:22] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:41:34] (03PS10) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [17:41:34] (03PS1) 10Btullis: presto: Fix the resource-groups configuration [puppet] - 10https://gerrit.wikimedia.org/r/1305481 (https://phabricator.wikimedia.org/T424112) [17:42:20] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T430072 [17:44:19] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Add jobrunner to beta mw_web_clusters list [puppet] - 10https://gerrit.wikimedia.org/r/1305483 (https://phabricator.wikimedia.org/T430075) [17:44:54] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305481 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [17:45:17] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:45:47] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:45:48] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:46:09] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1016.eqiad.wmnet [17:46:09] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host dse-k8s-worker1016.eqiad.wmnet [17:46:16] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:47:54] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305481 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [17:50:34] (03CR) 10Btullis: [C:03+2] presto: Fix the resource-groups configuration [puppet] - 10https://gerrit.wikimedia.org/r/1305481 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [17:51:13] (03CR) 10CDanis: [C:03+2] turnilo: drop kind:number from X-Is-Browser dim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305478 (owner: 10CDanis) [17:51:49] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T430072 [17:51:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:52:52] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [17:52:54] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1016.eqiad.wmnet with OS bookworm [17:52:55] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [17:53:20] (03Merged) 10jenkins-bot: turnilo: drop kind:number from X-Is-Browser dim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305478 (owner: 10CDanis) [17:54:11] !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [17:54:35] !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [17:55:36] (03PS7) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [17:55:38] !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [17:55:50] !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [17:59:19] FIRING: [2x] HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:00:03] (03PS8) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [18:00:04] brennen and jeena: Your horoscope predicts another MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T1800). [18:00:11] o/ [18:00:36] (03CR) 10CI reject: [V:04-1] opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [18:01:51] !log 1.47.0-wmf.8 train status (T423917): no current blockers, logs no worse than expected, rolling to group1 [18:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:56] T423917: 1.47.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T423917 [18:03:32] (03PS1) 10TrainBranchBot: group1 to 1.47.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305484 (https://phabricator.wikimedia.org/T423917) [18:03:35] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305484 (https://phabricator.wikimedia.org/T423917) (owner: 10TrainBranchBot) [18:03:41] (03PS3) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [18:04:32] (03Merged) 10jenkins-bot: group1 to 1.47.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305484 (https://phabricator.wikimedia.org/T423917) (owner: 10TrainBranchBot) [18:05:14] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [18:05:14] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [18:05:21] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [18:05:34] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1242: Upgrading db1242.eqiad.wmnet [18:06:05] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1242: Upgrading db1242.eqiad.wmnet [18:08:38] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1017.eqiad.wmnet with reason: host reimage [18:09:08] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1015.eqiad.wmnet with reason: host reimage [18:10:22] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1016.eqiad.wmnet with reason: host reimage [18:12:44] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.47.0-wmf.8 refs T423917 [18:12:45] cwilliams@cumin1003 major-upgrade (PID 2641087) is awaiting input [18:12:48] T423917: 1.47.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T423917 [18:14:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1017.eqiad.wmnet with reason: host reimage [18:16:00] !log kamila@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [18:16:55] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12051694 (10Scott_French) FWIW, it does not look like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1303341 was ever applied - i.e., th... [18:17:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1015.eqiad.wmnet with reason: host reimage [18:19:24] !log kamila@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [18:19:50] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [18:21:54] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [18:21:55] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [18:22:01] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [18:22:03] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1242: Upgrading db1242.eqiad.wmnet [18:22:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1016.eqiad.wmnet with reason: host reimage [18:22:13] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1242: Upgrading db1242.eqiad.wmnet [18:22:43] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1242.eqiad.wmnet with OS trixie [18:23:09] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [18:23:09] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [18:23:30] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2219: Upgrading db2219.codfw.wmnet [18:23:40] (03CR) 10Ahmon Dancy: "Beta-only change. Already live and working properly in beta." [puppet] - 10https://gerrit.wikimedia.org/r/1305483 (https://phabricator.wikimedia.org/T430075) (owner: 10Ahmon Dancy) [18:23:52] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2219: Upgrading db2219.codfw.wmnet [18:25:21] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2219.codfw.wmnet with OS trixie [18:25:27] (03PS11) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [18:25:27] (03PS1) 10Btullis: presto: Fix up the values for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305486 (https://phabricator.wikimedia.org/T424112) [18:25:37] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305486 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [18:29:21] (03CR) 10Btullis: [C:03+2] presto: Fix up the values for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305486 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [18:31:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [18:35:01] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [18:35:28] (03PS4) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [18:39:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1016.eqiad.wmnet with OS bookworm [18:40:11] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1242.eqiad.wmnet with reason: host reimage [18:40:19] !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:43:01] !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:43:45] !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [18:44:04] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2219.codfw.wmnet with reason: host reimage [18:44:21] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1242.eqiad.wmnet with reason: host reimage [18:45:37] !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [18:46:46] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [18:48:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[3456] - https://phabricator.wikimedia.org/T419892#12051798 (10ayounsi) {T424871} are the switch replacement tracking task to support 25G. But even after they arrive some time will be needed to configure/au... [18:48:09] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2219.codfw.wmnet with reason: host reimage [18:49:28] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:52:09] (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: Add jobrunner to beta mw_web_clusters list [puppet] - 10https://gerrit.wikimedia.org/r/1305483 (https://phabricator.wikimedia.org/T430075) (owner: 10Ahmon Dancy) [18:52:26] (03PS1) 10Dzahn: jenkins: configure upstream_host: "localhost" for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1305488 (https://phabricator.wikimedia.org/T418521) [18:53:25] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: new CNAME record for WikiLearn - https://phabricator.wikimedia.org/T429628#12051808 (10BCornwall) 05Open→03Resolved I'm marking this as resolved: Please feel free to reopen if this hasn't been! [18:56:17] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:58:35] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [19:01:31] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1242.eqiad.wmnet with OS trixie [19:01:58] !log applied latent admin_ng diffs for mw-pretrain - T427668 [19:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:02] T427668: Turn up the Pretrain MVP environment - https://phabricator.wikimedia.org/T427668 [19:02:18] !log applied latent admin_ng diffs for allow-urldownloaders GlobalNetworkPolicy - T430045 T427282 [19:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:25] T430045: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045 [19:02:25] T427282: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282 [19:03:55] (03PS2) 10Dzahn: jenkins: configure upstream_addr as localhost for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1305488 (https://phabricator.wikimedia.org/T418521) [19:05:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2219.codfw.wmnet with OS trixie [19:06:48] (03PS3) 10Dzahn: jenkins: configure upstream_addr as localhost for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1305488 (https://phabricator.wikimedia.org/T418521) [19:08:11] (03PS9) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [19:08:40] (03PS1) 10Jgreen: Remove deprecated civi.wm.o and civi.frdev.wm.o CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1305490 [19:13:17] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [19:14:04] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [19:14:14] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [19:15:02] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1305488/8778/contint1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1305488 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:15:13] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1242: Migration of db1242.eqiad.wmnet completed [19:15:14] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [19:16:13] (03CR) 10Dwisehaupt: [C:03+2] Remove deprecated civi.wm.o and civi.frdev.wm.o CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1305490 (owner: 10Jgreen) [19:17:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [19:17:45] (03CR) 10Jgreen: [C:03+2] Remove deprecated civi.wm.o and civi.frdev.wm.o CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1305490 (owner: 10Jgreen) [19:18:17] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2219: Migration of db2219.codfw.wmnet completed [19:18:25] !log jgreen@dns1004 START - running authdns-update [19:20:22] !log jgreen@dns1004 END - running authdns-update [19:45:42] (03PS10) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [19:47:08] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:45] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1242: Migration of db1242.eqiad.wmnet completed [20:00:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [20:02:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2219: Migration of db2219.codfw.wmnet completed [20:03:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [20:04:07] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [20:04:19] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [20:15:36] (03PS2) 10Jdlrobson: Restore menu tab underline style [skins/Vector] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305191 (https://phabricator.wikimedia.org/T428519) [20:15:41] (03PS1) 10C. Scott Ananian: Add $wgParserMigrationEnableParsoid as unified/fine-grained config [extensions/ParserMigration] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305500 [20:15:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ParserMigration] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305500 (owner: 10C. Scott Ananian) [20:16:36] Is there really no one in this window? [20:16:56] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: I'd like to add a last minute patch to this window [20:17:01] FIRING: [6x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1017:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [20:18:27] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [20:18:37] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [20:19:06] cscott: if you're able to deploy, go for it :) [20:19:34] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [20:20:01] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [20:20:03] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [20:20:31] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [20:21:32] TheresNoTime: spiderpig, spiderpig, no one slings code like a spiderpig... [20:21:54] :D [20:22:01] FIRING: [10x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [20:23:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/ParserMigration] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305500 (owner: 10C. Scott Ananian) [20:24:40] (03Merged) 10jenkins-bot: Add $wgParserMigrationEnableParsoid as unified/fine-grained config [extensions/ParserMigration] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305500 (owner: 10C. Scott Ananian) [20:25:12] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1305500|Add $wgParserMigrationEnableParsoid as unified/fine-grained config]] [20:27:01] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [20:27:13] !log cscott@deploy1003 cscott: Backport for [[gerrit:1305500|Add $wgParserMigrationEnableParsoid as unified/fine-grained config]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:27:36] !log cr1-eqiad# set chassis fpc 1 pic 1 port 5 speed 100g - T429623 [20:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:16] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [20:28:30] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [20:29:10] !log cscott@deploy1003 cscott: Continuing with deployment [20:33:33] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305500|Add $wgParserMigrationEnableParsoid as unified/fine-grained config]] (duration: 08m 21s) [20:33:36] (03PS11) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [20:33:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [20:34:06] !log draining one of eqiad-codfw transports for PIC bounce [20:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:41] FIRING: NetworkDeviceAlarmActive: Alarm active on cr1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [20:35:27] expected ^ [20:37:16] !log bouncing cr1-eqiad FPC1 PIC1 - T429623 [20:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:51] (03PS1) 10Reedy: InitialiseSettings: Require 2FA for all on arbcom_*wiki and conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305501 (https://phabricator.wikimedia.org/T428103) [20:44:41] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [20:49:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 18.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:50:42] (03PS12) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [20:51:12] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [20:53:42] (03PS1) 10Dreamy Jazz: Drop $wmgEmergencyCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305503 (https://phabricator.wikimedia.org/T429849) [20:53:48] jouncebot: nowandnext [20:53:48] For the next 0 hour(s) and 6 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T2000) [20:53:48] In 0 hour(s) and 6 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T2100) [20:55:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305503 (https://phabricator.wikimedia.org/T429849) (owner: 10Dreamy Jazz) [20:55:56] (03CR) 10JHathaway: sre.hosts.provision: introduce the wmfroot user (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [20:56:21] (03Merged) 10jenkins-bot: Drop $wmgEmergencyCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305503 (https://phabricator.wikimedia.org/T429849) (owner: 10Dreamy Jazz) [20:56:47] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1305503|Drop $wmgEmergencyCaptcha (T429849)]] [20:56:52] T429849: hCaptcha: Emergency CAPTCHA uses FancyCaptcha - https://phabricator.wikimedia.org/T429849 [20:57:41] (03PS13) 10Bking: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [20:57:49] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [20:58:50] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1305503|Drop $wmgEmergencyCaptcha (T429849)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:59:08] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260624T2100) [21:03:31] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305503|Drop $wmgEmergencyCaptcha (T429849)]] (duration: 06m 43s) [21:03:37] T429849: hCaptcha: Emergency CAPTCHA uses FancyCaptcha - https://phabricator.wikimedia.org/T429849 [21:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 16.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:14:02] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 761558048 and 61 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:23:02] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 24576 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:25:40] (03PS14) 10Ryan Kemper: opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844)