[00:02:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:23:42] (03CR) 10Dzahn: [V:03+1 C:03+2] "no worries - this file exists on the server but is currently NOT included in apache config (can be verified with 'apache2ctl -t -D DUMP_IN" [puppet] - 10https://gerrit.wikimedia.org/r/1300916 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:27:16] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [00:33:00] (03PS1) 10Dzahn: ci: load mod_ssl in httpd to be able to proxy https [puppet] - 10https://gerrit.wikimedia.org/r/1305531 (https://phabricator.wikimedia.org/T418521) [01:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [01:49:19] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:54:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:55:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:55:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:59:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:00:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:00:25] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:00:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:00:49] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 00m 24s) [02:04:41] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:05:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:29:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:29:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:29:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:44:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:44:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:44:55] !log pt1979@cumin1003 START - Cookbook sre.network.cf [02:44:56] !log pt1979@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0) [02:45:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:49:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:07:10] 10ops-eqiad, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/2026-Q3-Q4): Ensure cloudvirt capacity is more evenly spread out among racks - https://phabricator.wikimedia.org/T424658#12052421 (10Jclark-ctr) [03:56:32] (03PS1) 10RLazarus: admin_ng: Fix comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305542 [04:02:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:52] (03CR) 10RLazarus: "Added in If15f9cc5, looks like just a mispaste." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305542 (owner: 10RLazarus) [04:20:12] (03PS1) 10RLazarus: admin_ng: Remove obsolete coredns 1.8.7-2 tag, unset everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305544 [04:20:12] (03PS1) 10RLazarus: coredns: Parameterize `name` and `k8s_app` [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305545 (https://phabricator.wikimedia.org/T427864) [04:20:13] (03PS1) 10RLazarus: coredns: Add an internal_only value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305546 (https://phabricator.wikimedia.org/T427864) [04:20:14] (03PS1) 10RLazarus: admin_ng: Install coredns-internalonly in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305547 (https://phabricator.wikimedia.org/T427864) [04:27:16] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [04:34:17] (03CR) 10RLazarus: coredns: Parameterize `name` and `k8s_app` (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305545 (https://phabricator.wikimedia.org/T427864) (owner: 10RLazarus) [05:12:02] (03PS1) 10Marostegui: db1290: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1305549 (https://phabricator.wikimedia.org/T429929) [05:13:03] (03CR) 10Marostegui: [C:03+2] db1290: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1305549 (https://phabricator.wikimedia.org/T429929) (owner: 10Marostegui) [05:13:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2233].codfw.wmnet,db[1217,1228,1290].eqiad.wmnet with reason: Primary switchover m2 T429929 [05:13:45] T429929: Switchover m2 master (db1228 -> db1290) - https://phabricator.wikimedia.org/T429929 [05:15:27] (03PS1) 10Marostegui: mariadb: Promote db1290 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1305550 (https://phabricator.wikimedia.org/T429929) [05:16:25] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1290 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1305550 (https://phabricator.wikimedia.org/T429929) (owner: 10Marostegui) [05:17:41] !log Failover m2 from db1228 to db1290 - T429929 [05:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:19:41] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:41] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:12] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:35:12] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [05:42:51] (03PS1) 10Marostegui: db1228: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1305562 (https://phabricator.wikimedia.org/T430106) [05:46:18] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade gitlab [05:50:28] (03CR) 10Marostegui: [C:03+2] db1228: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1305562 (https://phabricator.wikimedia.org/T430106) (owner: 10Marostegui) [05:58:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1228.eqiad.wmnet with reason: Reimage to Trixie [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T0600) [06:00:05] marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T0600). [06:01:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1228.eqiad.wmnet with OS trixie [06:01:50] arnaudb@cumin1003 arnaudb: The backup on gitlab1004 is complete, ready to proceed with upgrade. [06:04:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:07:58] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [06:08:58] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30036 bytes in 0.500 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [06:11:32] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade gitlab [06:12:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:14:23] (03PS1) 10Marostegui: db1290: Add master role [puppet] - 10https://gerrit.wikimedia.org/r/1305563 [06:15:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1228.eqiad.wmnet with reason: host reimage [06:16:28] (03CR) 10Marostegui: [C:03+2] db1290: Add master role [puppet] - 10https://gerrit.wikimedia.org/r/1305563 (owner: 10Marostegui) [06:17:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:19:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1228.eqiad.wmnet with reason: host reimage [06:24:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:26:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[3456] - https://phabricator.wikimedia.org/T419892#12052640 (10fgiunchedi) Good find @ayounsi ! Also with respect to racking these hosts, my preference would be to have one host per rack once we can do 25G... [06:28:06] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305443 (https://phabricator.wikimedia.org/T430059) (owner: 10Muehlenhoff) [06:28:47] (03CR) 10Muehlenhoff: [C:03+2] Add Jesse to Bitu approvers [puppet] - 10https://gerrit.wikimedia.org/r/1305443 (https://phabricator.wikimedia.org/T430059) (owner: 10Muehlenhoff) [06:29:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:32:28] (03PS1) 10Ayounsi: depool-rack: properly call getattr() [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) [06:33:58] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#12052650 (10fgiunchedi) >>! In T401441#11986629, @VRiley-WMF wrote: > @fgiunchedi for these servers cloudcephosd1048, cloudcephosd1049, cloudcephosd1050, cloudcephosd1051 would we be able to sc... [06:35:14] (03CR) 10CI reject: [V:04-1] depool-rack: properly call getattr() [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [06:35:36] (03PS1) 10Muehlenhoff: Add Ahmon Dancy to releng-related approvals [puppet] - 10https://gerrit.wikimedia.org/r/1305566 [06:39:40] (03PS1) 10Muehlenhoff: Bitu: Add Ahmon Dancy as second approver for Spiderpig access [puppet] - 10https://gerrit.wikimedia.org/r/1305568 [06:40:35] (03PS1) 10Marostegui: installserver: Do not format clouddb102[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/1305569 (https://phabricator.wikimedia.org/T409557) [06:40:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1228.eqiad.wmnet with OS trixie [06:41:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:45:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:46:43] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:49:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:55:46] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305566 (owner: 10Muehlenhoff) [06:57:45] (03PS1) 10Arnaudb: gerrit: bump 4xx ratio to alert on [alerts] - 10https://gerrit.wikimedia.org/r/1305574 [06:57:53] (03CR) 10Arnaudb: [C:03+2] gerrit: bump 4xx ratio to alert on [alerts] - 10https://gerrit.wikimedia.org/r/1305574 (owner: 10Arnaudb) [07:00:05] Amir1, urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:30] (03Merged) 10jenkins-bot: gerrit: bump 4xx ratio to alert on [alerts] - 10https://gerrit.wikimedia.org/r/1305574 (owner: 10Arnaudb) [07:02:52] (03CR) 10Jelto: [C:03+1] "lgtm, thank you" [alerts] - 10https://gerrit.wikimedia.org/r/1305574 (owner: 10Arnaudb) [07:04:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1026.eqiad.wmnet with reason: Catching up [07:05:01] (03CR) 10Marostegui: [C:03+2] installserver: Do not format clouddb102[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/1305569 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [07:07:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1228.eqiad.wmnet with reason: Cloning [07:10:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2234.codfw.wmnet,db1250.eqiad.wmnet with reason: Upgrading [07:11:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2234.codfw.wmnet with OS trixie [07:13:12] PROBLEM - MariaDB Replica IO: m3 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2234.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2234.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [07:17:36] ^ known [07:18:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2160.codfw.wmnet with reason: Upgrading [07:19:51] (03PS1) 10Arnaudb: backups: edit gerrit fileset to exclude logs [puppet] - 10https://gerrit.wikimedia.org/r/1305578 (https://phabricator.wikimedia.org/T411583) [07:20:59] !log T423993: dropping ttmserver indices from the cirrussearch opensearch clusters [07:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:03] T423993: Upgrade old indices in the CirrusSearch opensearch clusters - https://phabricator.wikimedia.org/T423993 [07:23:53] (03PS2) 10Arnaudb: backups: edit gerrit fileset to exclude logs [puppet] - 10https://gerrit.wikimedia.org/r/1305578 (https://phabricator.wikimedia.org/T411583) [07:24:59] !log installing nginx security updates [07:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:42] (03CR) 10Arnaudb: [C:03+1] "looks good to me, ccing in @jwodstrcil@wikimedia.org @dzahn@wikimedia.org and @aokoth@wikimedia.org for information" [puppet] - 10https://gerrit.wikimedia.org/r/1305460 (https://phabricator.wikimedia.org/T430024) (owner: 10Brouberol) [07:29:01] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@86ab691] (releasing): T430110 Test on Jenkins secondary [07:29:32] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@86ab691] (releasing): T430110 Test on Jenkins secondary (duration: 00m 50s) [07:34:48] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12052815 (10MoritzMuehlenhoff) [07:35:16] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2234.codfw.wmnet with OS trixie [07:35:33] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2234.codfw.wmnet with OS trixie [07:41:17] 10ops-codfw, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116 (10Marostegui) 03NEW [07:41:45] 10ops-codfw, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12052845 (10Marostegui) p:05Triage→03Medium [07:41:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2160.codfw.wmnet with reason: Upgrading [07:43:05] (03CR) 10Muehlenhoff: profile::base::reboot_unattended: add class to mark hosts for unattended reboots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [07:47:44] !log filippo@cumin1003 START - Cookbook sre.dns.netbox [07:51:19] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on releases2003.codfw.wmnet with reason: T410849 [07:51:24] T410849: Update to Phorge/Arcanist upstream 2026-06-01 - https://phabricator.wikimedia.org/T410849 [07:52:08] !log filippo@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Allocate IPs for cloudvirt1077 - filippo@cumin1003" [07:52:12] !log filippo@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Allocate IPs for cloudvirt1077 - filippo@cumin1003" [07:52:12] !log filippo@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:02:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:52] !log marostegui@cumin1003 conftool action : set/weight=10; selector: name=clouddb1026.eqiad.wmnet [08:03:12] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1026.eqiad.wmnet,service=s1 [08:03:45] !log Pool clouddb1026:s1 with a bit of weight T409557 [08:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:50] T409557: Productionize new clouddb* hosts (clouddb1022-1033) - https://phabricator.wikimedia.org/T409557 [08:07:08] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@ec879e3] (releasing): T430110 retry Jenkins secondary [08:07:37] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@ec879e3] (releasing): T430110 retry Jenkins secondary (duration: 00m 53s) [08:09:13] (03PS2) 10Ayounsi: depool-rack: properly call getattr() [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) [08:09:51] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: register phabricator in the external-services [puppet] - 10https://gerrit.wikimedia.org/r/1305449 (https://phabricator.wikimedia.org/T430024) (owner: 10Brouberol) [08:09:54] (03CR) 10Brouberol: [C:03+2] phabricator: enable egress from the dse kubepods networks [puppet] - 10https://gerrit.wikimedia.org/r/1305460 (https://phabricator.wikimedia.org/T430024) (owner: 10Brouberol) [08:10:12] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@ec879e3] (releasing): T430110 deploy to Jenkins primary [08:10:55] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@ec879e3] (releasing): T430110 deploy to Jenkins primary (duration: 00m 52s) [08:19:16] (03CR) 10Ayounsi: [C:03+2] depool-rack: properly call getattr() [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [08:19:26] (03CR) 10Ayounsi: [C:03+2] "Self merge as it's a minor fix" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [08:22:01] (03Merged) 10jenkins-bot: depool-rack: properly call getattr() [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [08:27:16] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [08:27:21] (03CR) 10Jcrespo: [C:03+1] "Seems sensible, wanna me to merge it and check size change?" [puppet] - 10https://gerrit.wikimedia.org/r/1305578 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [08:27:28] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: introduce the wmfroot user (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [08:27:51] jouncebot: nowandnext [08:27:51] No deployments scheduled for the next 1 hour(s) and 32 minute(s) [08:27:51] In 1 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1000) [08:29:56] o/ I have some private code I'd like to deploy. Would now or some time today be acceptable? I have deployment rights and can self-service. Dreamy_Jazz has kindly agreed to help me if necessary. [08:32:08] (by which I mean if no one has any objections, I'll start as we're between windows) [08:33:08] (03PS12) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [08:33:08] (03PS1) 10Btullis: presto: Remove the queryType from the catch-all resource selector [puppet] - 10https://gerrit.wikimedia.org/r/1305596 (https://phabricator.wikimedia.org/T424112) [08:33:22] (03PS2) 10Btullis: presto: Remove the queryType from the catch-all resource selector [puppet] - 10https://gerrit.wikimedia.org/r/1305596 (https://phabricator.wikimedia.org/T424112) [08:33:26] (03PS1) 10Slyngshede: P:trafficserver::backend map thumb to swift backend [puppet] - 10https://gerrit.wikimedia.org/r/1305597 (https://phabricator.wikimedia.org/T427465) [08:33:30] (03PS13) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [08:33:44] (03CR) 10CI reject: [V:04-1] presto: Remove the queryType from the catch-all resource selector [puppet] - 10https://gerrit.wikimedia.org/r/1305596 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [08:35:26] (03PS1) 10Filippo Giunchedi: site: put cloudvirt1077 in service [puppet] - 10https://gerrit.wikimedia.org/r/1305598 (https://phabricator.wikimedia.org/T429563) [08:43:36] (03PS1) 10Jforrester: On AW article deletion, clear all AWArticleStore from sections and metadata [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305599 (https://phabricator.wikimedia.org/T429873) [08:43:38] (03PS1) 10Jforrester: AWStorage: Use global stash keys [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305600 (https://phabricator.wikimedia.org/T430060) [08:44:45] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1024-1025].eqiad.wmnet,db1155.eqiad.wmnet with reason: Reimaging db1221 [08:45:01] (03PS1) 10Elukey: sre.hosts.reimage: user ADMIN or root for ipmi/redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1305602 (https://phabricator.wikimedia.org/T426180) [08:45:27] !log marostegui@cumin1003 conftool action : set/weight=30; selector: name=clouddb1026.eqiad.wmnet [08:46:34] Tran, Dreamy_Jazz: Are you finished? I have a back-port to deploy. [08:46:46] No it's in progress, testing right now [08:47:18] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging2006.codfw.wmnet with OS trixie [08:47:43] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:47:43] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [08:47:49] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [08:47:53] Ack. [08:48:03] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1221: Upgrading db1221.eqiad.wmnet [08:48:33] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1221: Upgrading db1221.eqiad.wmnet [08:48:54] (03CR) 10Atsuko: [C:03+1] presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [08:49:03] (03CR) 10Atsuko: [C:03+1] presto: Remove the queryType from the catch-all resource selector [puppet] - 10https://gerrit.wikimedia.org/r/1305596 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [08:50:56] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305598 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [08:50:58] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1221.eqiad.wmnet with OS trixie [08:51:13] (03CR) 10Arnaudb: [C:03+2] "sounds good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1305578 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [08:52:01] (03CR) 10Filippo Giunchedi: [C:03+2] site: put cloudvirt1077 in service [puppet] - 10https://gerrit.wikimedia.org/r/1305598 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [08:52:23] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:52:23] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [08:52:43] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2236: Upgrading db2236.codfw.wmnet [08:52:45] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12053070 (10MoritzMuehlenhoff) >>! In T430045#12051694, @Scott_French wrote: > FWIW, it does not look like https://gerrit.wikimedia.org/r/c/operations/de... [08:53:05] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2236: Upgrading db2236.codfw.wmnet [08:54:41] (03CR) 10Btullis: [C:03+2] presto: Remove the queryType from the catch-all resource selector [puppet] - 10https://gerrit.wikimedia.org/r/1305596 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [08:54:52] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2236.codfw.wmnet with OS trixie [08:54:57] James_F, done. All you. [08:55:06] Thanks! [08:55:07] 06SRE, 10Citoid: citoid failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430053#12053086 (10MoritzMuehlenhoff) This is the same root cause as T430045, merging them. [08:55:29] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12053091 (10MoritzMuehlenhoff) [08:55:30] 06SRE, 10Citoid: citoid failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430053#12053089 (10MoritzMuehlenhoff) →14Duplicate dup:03T430045 [08:55:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305599 (https://phabricator.wikimedia.org/T429873) (owner: 10Jforrester) [08:55:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305600 (https://phabricator.wikimedia.org/T430060) (owner: 10Jforrester) [08:55:48] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2234.codfw.wmnet with OS trixie [08:55:49] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12053093 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:56:48] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:57:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:57:33] (03Merged) 10jenkins-bot: On AW article deletion, clear all AWArticleStore from sections and metadata [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305599 (https://phabricator.wikimedia.org/T429873) (owner: 10Jforrester) [08:57:35] (03Merged) 10jenkins-bot: AWStorage: Use global stash keys [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305600 (https://phabricator.wikimedia.org/T430060) (owner: 10Jforrester) [08:57:48] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:58:12] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1305599|On AW article deletion, clear all AWArticleStore from sections and metadata (T429873)]], [[gerrit:1305600|AWStorage: Use global stash keys (T430060)]] [08:58:18] T429873: Implement better deletion strategy for Abstract Content - https://phabricator.wikimedia.org/T429873 [08:58:18] T430060: AWArticleStore MainStash backend cross-wiki behavior not working as expected - https://phabricator.wikimedia.org/T430060 [08:58:44] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:00:21] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1305599|On AW article deletion, clear all AWArticleStore from sections and metadata (T429873)]], [[gerrit:1305600|AWStorage: Use global stash keys (T430060)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:00:46] !log jforrester@deploy1003 jforrester: Continuing with deployment [09:03:30] (03CR) 10Elukey: "Tested it with kafka-logging2006, a new host that doesn't have root deployed. All good :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305602 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [09:03:31] (03CR) 10Jelto: [C:03+1] backups: edit gerrit fileset to exclude logs [puppet] - 10https://gerrit.wikimedia.org/r/1305578 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [09:03:52] (03PS11) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [09:04:25] (03CR) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [09:05:27] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2006.codfw.wmnet with reason: host reimage [09:05:41] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305599|On AW article deletion, clear all AWArticleStore from sections and metadata (T429873)]], [[gerrit:1305600|AWStorage: Use global stash keys (T430060)]] (duration: 07m 29s) [09:05:47] T429873: Implement better deletion strategy for Abstract Content - https://phabricator.wikimedia.org/T429873 [09:05:48] T430060: AWArticleStore MainStash backend cross-wiki behavior not working as expected - https://phabricator.wikimedia.org/T430060 [09:06:23] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1221.eqiad.wmnet with reason: host reimage [09:06:58] !log marostegui@cumin1003 conftool action : set/weight=100; selector: name=clouddb1026.eqiad.wmnet [09:07:14] All done at my end now. [09:08:01] (03PS1) 10Brouberol: service: register the phabricator service [puppet] - 10https://gerrit.wikimedia.org/r/1305606 (https://phabricator.wikimedia.org/T430024) [09:08:04] (03PS1) 10Brouberol: service_proxy: register phabricator services [puppet] - 10https://gerrit.wikimedia.org/r/1305607 (https://phabricator.wikimedia.org/T430024) [09:08:18] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 6 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [09:08:55] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2006.codfw.wmnet with reason: host reimage [09:10:33] (03CR) 10Muehlenhoff: profile::base::reboot_unattended: add class to mark hosts for unattended reboots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [09:11:11] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2236.codfw.wmnet with reason: host reimage [09:11:57] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1221.eqiad.wmnet with reason: host reimage [09:12:55] (03CR) 10CWilliams: cookbooks/sre/mysql/decommission: add cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [09:12:57] (03CR) 10AikoChou: [C:03+1] ml-services: Deploy the latest version of article-country model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305392 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [09:12:57] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging2007.codfw.wmnet with OS trixie [09:13:38] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging2008.codfw.wmnet with OS trixie [09:15:53] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2236.codfw.wmnet with reason: host reimage [09:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:21:52] (03PS12) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [09:24:10] (03CR) 10CWilliams: mysql: update replication source (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [09:26:05] (03CR) 10CWilliams: mysql: update replication source (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [09:26:43] (03PS1) 10Kosta Harlan: hCaptcha: Skip blocked-IP score collection for crawlers in VE and MF [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305609 (https://phabricator.wikimedia.org/T429755) [09:26:45] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:26:54] jouncebot: nowandnext [09:26:54] No deployments scheduled for the next 0 hour(s) and 33 minute(s) [09:26:54] In 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1000) [09:26:55] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 6 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [09:27:05] deploying a wmf.8 patch [09:27:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305609 (https://phabricator.wikimedia.org/T429755) (owner: 10Kosta Harlan) [09:29:50] elukey@cumin1003 reimage (PID 2747312) is awaiting input [09:31:00] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1221.eqiad.wmnet with OS trixie [09:31:01] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2007.codfw.wmnet with reason: host reimage [09:31:45] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2008.codfw.wmnet with reason: host reimage [09:32:28] (03CR) 10CWilliams: mysql: update replication source (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [09:33:09] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2236.codfw.wmnet with OS trixie [09:34:45] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2007.codfw.wmnet with reason: host reimage [09:35:19] (03Merged) 10jenkins-bot: hCaptcha: Skip blocked-IP score collection for crawlers in VE and MF [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305609 (https://phabricator.wikimedia.org/T429755) (owner: 10Kosta Harlan) [09:35:50] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1305609|hCaptcha: Skip blocked-IP score collection for crawlers in VE and MF (T429755)]] [09:35:54] T429755: hCaptcha: Exclude self-identified crawlers from IP blocked edit notice risk score collection - https://phabricator.wikimedia.org/T429755 [09:37:54] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1305609|hCaptcha: Skip blocked-IP score collection for crawlers in VE and MF (T429755)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:38:33] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2008.codfw.wmnet with reason: host reimage [09:39:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:39:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2006.codfw.wmnet with OS trixie [09:40:19] !log kharlan@deploy1003 kharlan: Continuing with deployment [09:43:25] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1221: Migration of db1221.eqiad.wmnet completed [09:44:36] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305609|hCaptcha: Skip blocked-IP score collection for crawlers in VE and MF (T429755)]] (duration: 08m 46s) [09:44:41] T429755: hCaptcha: Exclude self-identified crawlers from IP blocked edit notice risk score collection - https://phabricator.wikimedia.org/T429755 [09:45:53] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1305617 (https://phabricator.wikimedia.org/T430127) [09:45:56] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2236: Migration of db2236.codfw.wmnet completed [09:52:20] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 6648 [09:52:36] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6648 [09:53:15] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:56:19] elukey@cumin1003 reimage (PID 2751279) is awaiting input [09:58:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:58:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2007.codfw.wmnet with OS trixie [09:58:09] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:58:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:58:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2008.codfw.wmnet with OS trixie [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1000) [10:00:53] (03CR) 10Blake: [C:03+2] kubernetes: Add a k8s deployment for pretrain. [puppet] - 10https://gerrit.wikimedia.org/r/1305358 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [10:02:21] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1305397 (https://phabricator.wikimedia.org/T420438) (owner: 10Klausman) [10:03:54] !log klausman@cumin2002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: ml-staging-master@codfw [10:04:08] (03CR) 10Klausman: [C:03+2] hiera: Switch ml-staging k8s to Maglev LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1305397 (https://phabricator.wikimedia.org/T420438) (owner: 10Klausman) [10:07:25] !log klausman@cumin2002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for alias: ml-staging-master@codfw [10:09:18] (03PS1) 10Elukey: pontoon: add config for the kafka-upgrade stack used for testing [puppet] - 10https://gerrit.wikimedia.org/r/1305620 [10:09:26] i intend to use the infra window to do a non-build stop-before-sync deploy, in order to populate the release file for a new deployment (mw-pretrain) (T427668), and will proceed in 5m if there are no objections [10:09:26] T427668: Turn up the Pretrain MVP environment - https://phabricator.wikimedia.org/T427668 [10:09:58] !log klausman@cumin2002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: ml-staging-master@codfw [10:10:32] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12053340 (10elukey) All hosts reimaged, we should be good! [10:13:32] !log klausman@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:14:06] (03CR) 10Nikerabbit: [C:03+1] Drop fund, phortune, support [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300846 (https://phabricator.wikimedia.org/T418655) (owner: 10Pppery) [10:14:20] !log klausman@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:14:20] !log klausman@cumin2002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: ml-staging-master@codfw [10:15:19] !log blake@deploy1003 Started scap sync-world: Non-deployment scap run to populate new release values [10:15:45] !log blake@deploy1003 Stopping before sync operations [10:15:56] (03CR) 10Federico Ceratto: mysql: update replication source (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [10:16:43] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1026.eqiad.wmnet,service=s1 [10:17:31] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#12053382 (10CWilliams-WMF) [10:18:13] (03PS2) 10Elukey: pontoon: add config for the kafka-upgrade stack used for testing [puppet] - 10https://gerrit.wikimedia.org/r/1305620 [10:21:42] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#12053392 (10CWilliams-WMF) @Volans I have linked the ticket that I created for a very similar scenario, now mar... [10:21:54] (03CR) 10Klausman: [C:03+2] role::ml_k8s::staging::worker: enable IPIP encapsulation [puppet] - 10https://gerrit.wikimedia.org/r/1294225 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [10:22:09] (03PS1) 10Bartosz Wójtowicz: rest-gateway: Add LiftWingLLM rate limit policy for LLM endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) [10:22:51] (03PS2) 10Elukey: Set ml-staging-ctrl to the Maglev scheduler and fix stale options [puppet] - 10https://gerrit.wikimedia.org/r/1294224 (https://phabricator.wikimedia.org/T420438) [10:23:09] (03CR) 10Klausman: [C:03+2] Set ml-staging-ctrl to the Maglev scheduler and fix stale options [puppet] - 10https://gerrit.wikimedia.org/r/1294224 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [10:23:15] (03CR) 10Klausman: [V:03+2 C:03+2] Set ml-staging-ctrl to the Maglev scheduler and fix stale options [puppet] - 10https://gerrit.wikimedia.org/r/1294224 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [10:25:06] (03CR) 10Klausman: [C:03+2] Set Maglev's scheduling for inference-staging and ingress [puppet] - 10https://gerrit.wikimedia.org/r/1294226 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [10:28:00] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#12053445 (10elukey) @RKemper I added you to the `kafka-infrastructure` cloud project, you should see it in Horizon! At this point,... [10:28:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1221: Migration of db1221.eqiad.wmnet completed [10:28:57] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:31:04] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#12053465 (10MoritzMuehlenhoff) This should soon no longer be an issue once https://phabricator.wikimedia.org/T4... [10:31:27] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2236: Migration of db2236.codfw.wmnet completed [10:31:28] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:32:12] 06SRE, 06Infrastructure-Foundations: Migrate remaining container build/report steps from to build2002 - https://phabricator.wikimedia.org/T417389#12053474 (10MoritzMuehlenhoff) I'll first add a new build host on trixie and then fail over to that instead. [10:34:55] (03PS1) 10Klausman: role/ml_k8s/staging/worker: add IPIP role [puppet] - 10https://gerrit.wikimedia.org/r/1305623 (https://phabricator.wikimedia.org/T42043) [10:35:45] 06SRE, 06Infrastructure-Foundations: Migrate remaining container build/report steps from to build2003 - https://phabricator.wikimedia.org/T417389#12053489 (10MoritzMuehlenhoff) [10:36:02] (03CR) 10Ozge: [C:03+1] ml-services: Deploy the latest version of article-country model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305392 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [10:37:14] !log jmm@cumin2003 START - Cookbook sre.ganeti.makevm for new host build2003.codfw.wmnet [10:37:16] !log jmm@cumin2003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host build2003.codfw.wmnet [10:37:21] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8781/co" [puppet] - 10https://gerrit.wikimedia.org/r/1305623 (https://phabricator.wikimedia.org/T42043) (owner: 10Klausman) [10:38:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host build2003.codfw.wmnet [10:38:57] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host build2003.codfw.wmnet [10:41:06] 06SRE, 06Infrastructure-Foundations: Migrate remaining container build/report steps from to build2004 - https://phabricator.wikimedia.org/T417389#12053500 (10MoritzMuehlenhoff) [10:42:48] (03PS1) 10Muehlenhoff: Add build2004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1305627 (https://phabricator.wikimedia.org/T417389) [10:43:01] (03PS2) 10Muehlenhoff: Add build2004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1305627 (https://phabricator.wikimedia.org/T417389) [10:43:52] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304147 (owner: 10PipelineBot) [10:45:32] (03CR) 10Elukey: [C:03+1] "recheck" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298541 (owner: 10Volans) [10:46:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [10:46:33] (03CR) 10Daniel Kinzler: "That looks about right at a glance" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [10:49:49] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy the latest version of article-country model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305392 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [10:50:29] (03CR) 10CI reject: [V:04-1] config: type config_file as PathLike[str] [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298541 (owner: 10Volans) [10:52:01] (03Merged) 10jenkins-bot: ml-services: Deploy the latest version of article-country model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305392 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [10:55:46] (03CR) 10Muehlenhoff: [C:03+2] Add build2004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1305627 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [10:57:27] (03CR) 10Ilias Sarantopoulos: "lgtm! I'll defer to claime for the technical review but I can comment that I agree on the limits and the grouping as it is in line with wh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [11:00:12] (03CR) 10Clément Goubert: [C:03+1] "Looks right, just a quick question, will qwen314b be moved under that path?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [11:00:39] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#12053564 (10CWilliams-WMF) @MoritzMuehlenhoff thanks for that! [11:02:19] (03PS1) 10JavierMonton: k8s namespace: webrequest-page-trending [puppet] - 10https://gerrit.wikimedia.org/r/1305630 (https://phabricator.wikimedia.org/T430136) [11:05:00] !log jforrester@deploy1003: mwscript sql.php --wiki=wikifunctionswiki --cluster extension1 extensions/WikiLambda/sql/mysql/table-wikifunctions_usage.sql # T428667 [11:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:04] T428667: Create a new x1 tables for cross-wiki tracking of Wikifunctions usage, similar to GlobalUsage - https://phabricator.wikimedia.org/T428667 [11:05:07] !log jforrester@deploy1003: mwscript sql.php --wiki=wikifunctionswiki --cluster extension1 extensions/WikiLambda/sql/mysql/table-wikifunctions_usage_wikis.sql # T428667 [11:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:02] (03PS1) 10JavierMonton: namespaces: webrequest-page-trending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305632 (https://phabricator.wikimedia.org/T430136) [11:09:13] !log jmm@cumin2003 START - Cookbook sre.ganeti.makevm for new host build2004.codfw.wmnet [11:09:16] !log jmm@cumin2003 START - Cookbook sre.dns.netbox [11:10:27] (03CR) 10Jelto: [V:03+1 C:03+2] profile::base::reboot_unattended: add class to mark hosts for unattended reboots (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [11:11:42] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1208) - https://phabricator.wikimedia.org/T430138 (10LSobanski) 03NEW [11:12:12] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T430139 (10LSobanski) 03NEW [11:14:28] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [11:14:39] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [11:15:09] (03PS2) 10JavierMonton: namespaces: webrequest-page-trending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305632 (https://phabricator.wikimedia.org/T430136) [11:15:23] jmm@cumin2003 makevm (PID 317727) is awaiting input [11:16:45] (03PS3) 10JavierMonton: namespaces: pageview-trending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305632 (https://phabricator.wikimedia.org/T430136) [11:17:02] (03PS2) 10JavierMonton: k8s namespace: pageview-trending [puppet] - 10https://gerrit.wikimedia.org/r/1305630 (https://phabricator.wikimedia.org/T430136) [11:24:02] (03PS1) 10Fabfur: cache::haproxy: add correlation id feature [puppet] - 10https://gerrit.wikimedia.org/r/1305635 (https://phabricator.wikimedia.org/T426379) [11:25:34] !log jmm@cumin2003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM build2004.codfw.wmnet - jmm@cumin2003" [11:25:38] !log jmm@cumin2003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM build2004.codfw.wmnet - jmm@cumin2003" [11:25:38] !log jmm@cumin2003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:25:39] !log jmm@cumin2003 START - Cookbook sre.dns.wipe-cache build2004.codfw.wmnet on all recursors [11:25:42] !log jmm@cumin2003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) build2004.codfw.wmnet on all recursors [11:26:13] !log jmm@cumin2003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM build2004.codfw.wmnet - jmm@cumin2003" [11:26:18] !log jmm@cumin2003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM build2004.codfw.wmnet - jmm@cumin2003" [11:27:36] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305635 (https://phabricator.wikimedia.org/T426379) (owner: 10Fabfur) [11:28:28] !log jmm@cumin2003 START - Cookbook sre.hosts.reimage for host build2004.codfw.wmnet with OS trixie [11:40:08] (03CR) 10Bartosz Wójtowicz: "Yes, the plan is to move qwen3-14b and other LLMs we would like to expose under this path, it'd be done in follow-up patches." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [11:44:01] 06SRE, 06Infrastructure-Foundations: Migrate remaining container build/report steps from build2001 to build2004 - https://phabricator.wikimedia.org/T417389#12053767 (10LSobanski) [11:45:02] (03CR) 10Clément Goubert: [C:03+1] "Then be mindful that the pipe caching will only match `openai/v1` https://gerrit.wikimedia.org/r/c/operations/puppet/+/1293746" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [11:47:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305602 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [11:48:38] !log jmm@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on build2004.codfw.wmnet with reason: host reimage [11:50:33] !log installing harfbuzz security updates [11:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:35] !log jmm@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on build2004.codfw.wmnet with reason: host reimage [11:57:09] (03PS1) 10Dpogorzelski: ml-serve: add GPU partition allocation metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305643 (https://phabricator.wikimedia.org/T429597) [11:57:21] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lsw1-a7-codfw,lsw1-a7-codfw IPv6,lsw1-a7-codfw.mgmt with reason: Switch maintenance [11:57:30] (03CR) 10CI reject: [V:04-1] ml-serve: add GPU partition allocation metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305643 (https://phabricator.wikimedia.org/T429597) (owner: 10Dpogorzelski) [11:58:33] (03PS2) 10Dpogorzelski: ml-serve: add GPU partition allocation metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305643 (https://phabricator.wikimedia.org/T429597) [11:58:47] (03PS1) 10Mareike Heuer: Remove wgCiteRemoveSyntheticRefsUnsafe feature flag from production and beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305639 (https://phabricator.wikimedia.org/T428232) [11:59:39] (03PS1) 10Muehlenhoff: Add library hint for harfbuff [puppet] - 10https://gerrit.wikimedia.org/r/1305644 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1200) [12:02:21] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for harfbuff [puppet] - 10https://gerrit.wikimedia.org/r/1305644 (owner: 10Muehlenhoff) [12:02:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:21] (03CR) 10Jforrester: "Note that these tables are now live, so this would be good to land soon to avoid alerts." [puppet] - 10https://gerrit.wikimedia.org/r/1305102 (https://phabricator.wikimedia.org/T428667) (owner: 10Jforrester) [12:04:35] !log ayounsi@cumin1003 START - Cookbook sre.network.depool-rack with action 'depool' for codfw rack A7 [12:05:02] (03PS2) 10Bartosz Wójtowicz: rest-gateway: Add LiftWingLLM rate limit policy for LLM endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) [12:06:25] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 31 hosts with reason: Rack A7 depool [12:07:04] (03CR) 10Bartosz Wójtowicz: "Noted, our endpoints are all under `openai/v1` so to be consistent I've narrowed the route's regex match to only `openai/v1` to match the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [12:08:01] !log aokoth@cumin1003 START - Cookbook sre.hosts.decommission for hosts phab2002.codfw.wmnet [12:08:24] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2224: rack depool [12:08:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2224: rack depool [12:09:02] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2225: rack depool [12:09:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2225: rack depool [12:09:36] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool es2045: rack depool [12:09:57] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2045: rack depool [12:12:39] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:12:39] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [12:12:46] (03PS1) 10Atsuko: data-platform/k8s: monitor for unreleased k8s changes [alerts] - 10https://gerrit.wikimedia.org/r/1305646 (https://phabricator.wikimedia.org/T423078) [12:12:51] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-wdqs2003.codfw.wmnet,dse-k8s-wdqs-test2001.codfw.wmnet [12:12:53] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-wdqs2003.codfw.wmnet,dse-k8s-wdqs-test2001.codfw.wmnet [12:12:59] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1243: Upgrading db1243.eqiad.wmnet [12:13:14] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2258-2259].codfw.wmnet [12:13:17] !log aokoth@cumin1003 START - Cookbook sre.dns.netbox [12:13:20] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1243: Upgrading db1243.eqiad.wmnet [12:13:52] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for an-redacteddb1001.eqiad.wmnet,clouddb[1015,1024-1025].eqiad.wmnet,db1155.eqiad.wmnet [12:13:55] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-redacteddb1001.eqiad.wmnet,clouddb[1015,1024-1025].eqiad.wmnet,db1155.eqiad.wmnet [12:14:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2258-2259].codfw.wmnet [12:14:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.depool-rack (exit_code=0) with action 'depool' for codfw rack A7 [12:14:49] !log jmm@cumin2003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on logstash2023.codfw.wmnet with reason: A7 maintenace [12:15:47] !log jmm@cumin2003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd2003.codfw.wmnet with reason: A7 maintenace [12:15:54] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12053887 (10Jhancock.wm) [12:15:59] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1243.eqiad.wmnet with OS trixie [12:16:10] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:16:10] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [12:16:12] !log jmm@cumin2003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dse-k8s-etcd2001.codfw.wmnet with reason: A7 maintenace [12:16:20] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2237: Upgrading db2237.codfw.wmnet [12:16:36] !log jmm@cumin2003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagemaster2005.codfw.wmnet with reason: A7 maintenace [12:16:42] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2237: Upgrading db2237.codfw.wmnet [12:17:23] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1305648 (owner: 10L10n-bot) [12:18:16] !log aokoth@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1003" [12:18:25] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2237.codfw.wmnet with OS trixie [12:19:15] !log aokoth@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1003" [12:19:15] !log aokoth@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:16] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts phab2002.codfw.wmnet [12:20:33] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:24:11] (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1305646 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [12:26:12] (03PS1) 10AOkoth: site: remove phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/1305661 (https://phabricator.wikimedia.org/T423727) [12:27:16] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [12:28:56] (03CR) 10Hashar: [C:03+1] ci: load mod_ssl in httpd to be able to proxy https [puppet] - 10https://gerrit.wikimedia.org/r/1305531 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [12:29:45] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12053940 (10Jhancock.wm) 05Open→03Resolved [12:31:39] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1019 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:31:42] (03CR) 10Atsuko: [C:03+2] data-platform/k8s: monitor for unreleased k8s changes [alerts] - 10https://gerrit.wikimedia.org/r/1305646 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [12:32:33] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1019 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:32:46] (03PS5) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [12:33:07] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1243.eqiad.wmnet with reason: host reimage [12:33:17] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: user ADMIN or root for ipmi/redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1305602 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [12:33:48] (03Merged) 10jenkins-bot: data-platform/k8s: monitor for unreleased k8s changes [alerts] - 10https://gerrit.wikimedia.org/r/1305646 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [12:35:01] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2237.codfw.wmnet with reason: host reimage [12:37:35] federico3, _joe_, I'm going to reboot rack A7 switch for maintenance, all servers are depooled, the only unknown is cephosd2001, I can't get a hold on anyone to know if it needs a depool or not, but afaik those services are fault tolerant. Everything is downtimed. [12:38:16] <_joe_> uhm chephosd I suppose is DPE SRE? [12:38:28] <_joe_> btullis / brouberol, any idea? [12:38:29] XioNoX: these are maintained by Data Platform, I usually ping Ben for things [12:38:45] _joe_: yeah, I pinged everyone multiple times [12:38:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1243.eqiad.wmnet with reason: host reimage [12:39:22] <_joe_> gehel: ^^ please advise [12:39:31] XioNoX: thanks, ack [12:39:53] (03PS1) 10Elukey: Pin pytest version and fix mypy errors in config.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 [12:40:42] <_joe_> XioNoX: give me a couple minutes, I'm pinging people on slack [12:40:50] _joe_: thanks for the help! [12:41:46] we should be good, but let me check [12:41:58] (03CR) 10Klausman: [C:03+1] ml-serve: add GPU partition allocation metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305643 (https://phabricator.wikimedia.org/T429597) (owner: 10Dpogorzelski) [12:42:51] <3 [12:43:00] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2237.codfw.wmnet with reason: host reimage [12:43:18] (03CR) 10Elukey: "This is the starting point to get CI working again, then we'll be able to rebase Riccardo's and Jesse's patches on top. Lemme know!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 (owner: 10Elukey) [12:43:26] XioNoX: we're good, you can go ahead! [12:43:48] <_joe_> :) thanks gehel [12:44:06] awesome, thanks! [12:44:31] !log lsw1-a7-codfw> request system reboot - T429817 [12:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:36] T429817: codfw: rack A7 maintenance - https://phabricator.wikimedia.org/T429817 [12:44:40] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: add GPU partition allocation metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305643 (https://phabricator.wikimedia.org/T429597) (owner: 10Dpogorzelski) [12:46:01] (03CR) 10Volans: [C:04-1] "LGTM but we need to support bullseye too" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 (owner: 10Elukey) [12:47:10] (03PS2) 10Elukey: Pin pytest version and fix mypy errors in config.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 [12:47:18] (03CR) 10Elukey: Pin pytest version and fix mypy errors in config.py (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 (owner: 10Elukey) [12:47:48] PROBLEM - BFD status on ssw1-a8-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:47:48] PROBLEM - BFD status on ssw1-a1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:48:54] (03CR) 10Volans: [C:03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 (owner: 10Elukey) [12:49:14] _joe_ XioNoX sorry I was OOO for a bit [12:49:22] reading the backscroll [12:49:24] (03PS1) 10Filippo Giunchedi: hieradata: add nova id for cloudvirt1077 [puppet] - 10https://gerrit.wikimedia.org/r/1305668 (https://phabricator.wikimedia.org/T429563) [12:49:25] <_joe_> brouberol: I doubt you can be forgiven [12:49:41] <_joe_> to the gallows! [12:49:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/6 (Core: lsw1-a7-codfw:et-0/0/55 {#230403800019}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:49:54] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a7-codfw (10.192.252.9) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:49:55] XioNoX: feel free to reboot the rack. That ceph cluster is un-used atm anyway [12:50:15] I'll see myself to the gallows then [12:51:20] <_joe_> brouberol: sorry, I don't make the rules, or the business needs. [12:51:40] shot taken [12:51:55] <_joe_> :always_has_been: [12:52:17] (03CR) 10Filippo Giunchedi: "root@cloudvirt1077:~# cat /etc/nova/compute_id" [puppet] - 10https://gerrit.wikimedia.org/r/1305668 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [12:52:38] (03CR) 10Elukey: [C:03+2] Pin pytest version and fix mypy errors in config.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 (owner: 10Elukey) [12:54:41] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:56:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1243.eqiad.wmnet with OS trixie [12:57:37] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [12:57:44] switch is back up [12:57:47] (03CR) 10Volans: [C:03+2] hieradata: add nova id for cloudvirt1077 [puppet] - 10https://gerrit.wikimedia.org/r/1305668 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [12:57:50] RECOVERY - BFD status on ssw1-a8-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:57:50] RECOVERY - BFD status on ssw1-a1-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:58:03] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305668 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [12:58:19] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: add nova id for cloudvirt1077 [puppet] - 10https://gerrit.wikimedia.org/r/1305668 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [12:58:31] going to proceed with the repool very soon [12:58:38] (03PS1) 10Anzx: isvwiki: set timezone, sitename and logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305476 (https://phabricator.wikimedia.org/T429935) [12:59:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305476 (https://phabricator.wikimedia.org/T429935) (owner: 10Anzx) [12:59:39] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2237.codfw.wmnet with OS trixie [12:59:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:59:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/6 (Core: lsw1-a7-codfw:et-0/0/55 {#230403800019}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:59:54] RESOLVED: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a7-codfw (10.192.252.9) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:00:04] Lucas_WMDE, urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1300). [13:00:05] anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudvirt1077.eqiad.wmnet with OS trixie [13:00:17] o/ [13:00:28] (03PS6) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [13:01:25] FIRING: SystemdUnitFailed: netbox_ganeti_codfw_test_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:34] !log installing glib2.0 security updates [13:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:37] RESOLVED: [6x] CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [13:02:56] I can deploy [13:03:35] moritzm: you can repool A7 ganeti [13:03:40] XioNoX: on it [13:04:12] (03CR) 10Zabe: [C:03+2] isvwiki: set timezone, sitename and logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305476 (https://phabricator.wikimedia.org/T429935) (owner: 10Anzx) [13:04:47] (03CR) 10Zabe: [C:03+2] Use Hadoop for Mostcategories on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248909 (https://phabricator.wikimedia.org/T413362) (owner: 10Zabe) [13:04:59] (03PS1) 10Ottomata: html_content_change - bump image to v1.56.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305672 (https://phabricator.wikimedia.org/T427598) [13:05:12] (03Merged) 10jenkins-bot: isvwiki: set timezone, sitename and logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305476 (https://phabricator.wikimedia.org/T429935) (owner: 10Anzx) [13:05:27] !log jmm@cumin2003 START - Cookbook sre.ganeti.addnode for new host ganeti2028.codfw.wmnet to cluster codfw and group A [13:05:29] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305620 (owner: 10Elukey) [13:05:42] (03Merged) 10jenkins-bot: Use Hadoop for Mostcategories on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248909 (https://phabricator.wikimedia.org/T413362) (owner: 10Zabe) [13:06:15] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1305476|isvwiki: set timezone, sitename and logos (T429935)]], [[gerrit:1248909|Use Hadoop for Mostcategories on commonswiki (T413362)]] [13:06:17] !log re-added ganeti2028 to codfw/A Ganeti cluster T429817 [13:06:22] T429935: Post-creation work for isvwiki - https://phabricator.wikimedia.org/T429935 [13:06:24] T413362: Move Mostcategories computation to Hadoop - https://phabricator.wikimedia.org/T413362 [13:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:28] T429817: codfw: rack A7 maintenance - https://phabricator.wikimedia.org/T429817 [13:07:11] (03CR) 10Ottomata: [C:03+2] html_content_change - bump image to v1.56.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305672 (https://phabricator.wikimedia.org/T427598) (owner: 10Ottomata) [13:07:57] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2258-2259].codfw.wmnet [13:07:59] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2258-2259].codfw.wmnet [13:08:23] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-wdqs2003.codfw.wmnet,dse-k8s-wdqs-test2001.codfw.wmnet [13:08:24] !log zabe@deploy1003 zabe, anzx: Backport for [[gerrit:1305476|isvwiki: set timezone, sitename and logos (T429935)]], [[gerrit:1248909|Use Hadoop for Mostcategories on commonswiki (T413362)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-wdqs2003.codfw.wmnet,dse-k8s-wdqs-test2001.codfw.wmnet [13:08:36] looking [13:09:16] !log ayounsi@cumin1003 START - Cookbook sre.mysql.pool pool db2225: rack depool [13:09:17] zabe: looks good, ok to sync [13:09:18] (03Merged) 10jenkins-bot: html_content_change - bump image to v1.56.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305672 (https://phabricator.wikimedia.org/T427598) (owner: 10Ottomata) [13:09:21] Thanks! [13:09:25] !log zabe@deploy1003 zabe, anzx: Continuing with deployment [13:09:45] !log ayounsi@cumin1003 START - Cookbook sre.mysql.pool pool db2224: rack depool [13:10:25] jmm@cumin2003 addnode (PID 352666) is awaiting input [13:11:30] zabe: please run namespacedupes.php for isvwiki after sync [13:11:32] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1243: Migration of db1243.eqiad.wmnet completed [13:11:37] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-wdqs-test2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:12:05] !log filippo@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1077.eqiad.wmnet with reason: host reimage [13:13:37] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:13:45] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305476|isvwiki: set timezone, sitename and logos (T429935)]], [[gerrit:1248909|Use Hadoop for Mostcategories on commonswiki (T413362)]] (duration: 07m 30s) [13:13:54] T429935: Post-creation work for isvwiki - https://phabricator.wikimedia.org/T429935 [13:13:54] T413362: Move Mostcategories computation to Hadoop - https://phabricator.wikimedia.org/T413362 [13:14:04] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:14:06] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:14:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:14:25] (03CR) 10CWilliams: mysql: update replication source (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [13:14:34] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:15:11] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2237: Migration of db2237.codfw.wmnet completed [13:16:22] !log zabe@deploy1003:~$ mwscript namespaceDupes.php isvwiki --fix # T429935 [13:16:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_test_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:52] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:16:52] zabe: thanks for deploying [13:16:57] yw [13:16:57] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:17:21] (03CR) 10Bking: [C:03+2] opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [13:17:35] zabe@deploy1003:~$ echo 'https://en.wikipedia.org/static/images/project-logos/isvwiki.png' | mwscript-k8s --attach purgeList.php -- --wiki enwiki # T429935 [13:18:03] (probably not needed since it is new and not a change) [13:19:00] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1077.eqiad.wmnet with reason: host reimage [13:19:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:20:32] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:21:01] !log jmm@cumin2003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2028.codfw.wmnet to cluster codfw and group A [13:27:56] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-wdqs-test2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:28:34] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:28:36] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:28:37] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:28:41] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:28:45] (03PS4) 10JavierMonton: namespaces: pageview-trending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305632 (https://phabricator.wikimedia.org/T430136) [13:29:06] (03PS7) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [13:29:33] (03PS3) 10JavierMonton: k8s namespace: pageview-trending [puppet] - 10https://gerrit.wikimedia.org/r/1305630 (https://phabricator.wikimedia.org/T430136) [13:29:40] (03CR) 10CI reject: [V:04-1] mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [13:33:58] (03PS8) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [13:37:36] !log installing imagemagick security updates [13:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:52] (03PS1) 10Muehlenhoff: Add Hiera config for build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1305677 (https://phabricator.wikimedia.org/T417389) [13:43:20] (03CR) 10Elukey: [C:03+1] Add Hiera config for build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1305677 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [13:47:28] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1011 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:47:56] (03CR) 10Muehlenhoff: [C:03+2] Add Hiera config for build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1305677 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [13:48:28] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1011 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:50:55] (03CR) 10Andrew Bogott: [C:03+1] "I have cherry-picked this to the cloud-vps puppetserver; I don't like having local patches there so we need to decide whether to merge or " [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [13:51:12] !log jmm@cumin2003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host build2004.codfw.wmnet with OS trixie [13:51:12] !log jmm@cumin2003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host build2004.codfw.wmnet [13:53:27] (03PS1) 10Muehlenhoff: Iniitally install build2004 with insetup [puppet] - 10https://gerrit.wikimedia.org/r/1305678 [13:53:48] (03PS1) 10AikoChou: ml-services: bump event-emitting isvc image tags in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305679 (https://phabricator.wikimedia.org/T421237) [13:54:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2225: rack depool [13:55:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2224: rack depool [13:57:03] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1243: Migration of db1243.eqiad.wmnet completed [13:57:04] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:00:41] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2237: Migration of db2237.codfw.wmnet completed [14:00:42] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:00:44] (03PS1) 10Dpogorzelski: ml-serve: GPU partitions by size and MI210 support [puppet] - 10https://gerrit.wikimedia.org/r/1305680 (https://phabricator.wikimedia.org/T429597) [14:01:54] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: GPU partitions by size and MI210 support [puppet] - 10https://gerrit.wikimedia.org/r/1305680 (https://phabricator.wikimedia.org/T429597) (owner: 10Dpogorzelski) [14:04:02] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1077.eqiad.wmnet with OS trixie [14:05:15] !log Ran `delete from cuci_user where ciu_ciwm_id = 4;` for T430156 [14:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:19] T430156: Wikimedia\Rdbms\DBQueryError: Error 1049: Unknown database 'apiportalwiki'Function: Wikimedia\Rdbms\DatabaseMySQL::doSelectDomainQuery: USE `apiportalwiki` - https://phabricator.wikimedia.org/T430156 [14:08:35] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:08:35] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [14:08:42] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [14:08:56] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1247: Upgrading db1247.eqiad.wmnet [14:09:26] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1247: Upgrading db1247.eqiad.wmnet [14:11:32] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1247.eqiad.wmnet with OS trixie [14:11:50] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:11:50] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [14:12:11] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2245: Upgrading db2245.codfw.wmnet [14:12:33] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2245: Upgrading db2245.codfw.wmnet [14:14:03] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2245.codfw.wmnet with OS trixie [14:19:57] (03CR) 10Muehlenhoff: [C:03+2] Iniitally install build2004 with insetup [puppet] - 10https://gerrit.wikimedia.org/r/1305678 (owner: 10Muehlenhoff) [14:21:02] !log Restarting CI Jenkins on contint1002 [14:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:03] !log jmm@cumin2003 START - Cookbook sre.hosts.reimage for host build2004.codfw.wmnet with OS trixie [14:28:31] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1247.eqiad.wmnet with reason: host reimage [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1430) [14:30:32] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2245.codfw.wmnet with reason: host reimage [14:34:20] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1247.eqiad.wmnet with reason: host reimage [14:34:38] 06SRE, 10SRE-Access-Requests: Requesting access for lerickson to deploy the RDF streaming updater on wikikube - https://phabricator.wikimedia.org/T429610#12054485 (10thcipriani) >>! In T429610#12039261, @MoritzMuehlenhoff wrote: > @thcipriani This needs your approval for the deployment group. Sorry for delay,... [14:36:22] !log Drop database apiportalwiki on sanitarium and wikireplicas T430102 [14:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:27] T430102: Delete apiportalwiki from wikireplicas - https://phabricator.wikimedia.org/T430102 [14:36:32] (03CR) 10Ahmon Dancy: "Thanks Andrew. It worked! So, I'll submit a new patchset which removes the deployment-dancy* hostname test condition." [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [14:37:28] (03CR) 10Fabfur: [C:03+1] role/ml_k8s/staging/worker: add IPIP role [puppet] - 10https://gerrit.wikimedia.org/r/1305623 (https://phabricator.wikimedia.org/T42043) (owner: 10Klausman) [14:38:07] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2245.codfw.wmnet with reason: host reimage [14:38:18] jouncebot: nowandnext [14:38:18] For the next 0 hour(s) and 21 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1430) [14:38:18] In 0 hour(s) and 21 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1500) [14:39:51] 06SRE, 06Infrastructure-Foundations: Adding Jesse to approvers for Bitu - https://phabricator.wikimedia.org/T430059#12054505 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Access has been setup and confirmed to be working fine. [14:40:07] (03CR) 10Hnowlan: [C:03+2] redis: disable nrpe checks, replace with prometheus checks [puppet] - 10https://gerrit.wikimedia.org/r/1305347 (https://phabricator.wikimedia.org/T384924) (owner: 10Tiziano Fogli) [14:40:25] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12054512 (10Jhancock.wm) @Marostegui that's correct. i upgraded the bios (and the idrac) firmware and it's fixed the issue. it did boot rather than reimage. So you might have to restart whatever you... [14:42:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2234.codfw.wmnet with OS trixie [14:42:30] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12054522 (10Marostegui) Thank you @Jhancock.wm - just restarted the reimage! [14:43:09] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12054529 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:43:21] (03PS5) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [14:43:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304888 (https://phabricator.wikimedia.org/T429830) (owner: 10Arlolra) [14:47:48] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new: Setup url-downloader-next.w.o to simply tests - https://phabricator.wikimedia.org/T430166 (10MoritzMuehlenhoff) 03NEW [14:48:05] !log jmm@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on build2004.codfw.wmnet with reason: host reimage [14:48:13] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12054558 (10MoritzMuehlenhoff) >>! In T430045#12053070, @MoritzMuehlenhoff wrote: > One other actionable is to add a new CNAME url-downloader-next, which... [14:51:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1247.eqiad.wmnet with OS trixie [14:52:04] !log jmm@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on build2004.codfw.wmnet with reason: host reimage [14:52:09] (03CR) 10Ahmon Dancy: "@abogott@wikimedia.org This is the desired final version. Lemme know what you think." [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [14:53:12] (03PS14) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [14:53:12] (03PS1) 10Btullis: presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) [14:54:01] (03CR) 10CI reject: [V:04-1] presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:54:05] (03CR) 10CI reject: [V:04-1] presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:56:09] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2245.codfw.wmnet with OS trixie [14:57:21] (03PS2) 10Btullis: presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) [14:57:21] (03PS15) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [14:57:48] (03PS3) 10Btullis: presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) [14:57:49] (03CR) 10CI reject: [V:04-1] presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:58:02] (03CR) 10CI reject: [V:04-1] presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:58:10] (03PS16) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [14:59:32] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-06-23-135458 to 2026-06-25-145651 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305698 (https://phabricator.wikimedia.org/T416144) [14:59:34] (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [14:59:39] !log ongoing maintenance on cr2-eqdfw [14:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:04] brennen and jeena: I, the Bot under the Fountain, call upon thee, The Deployer, to do Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1500). [15:00:35] !log pt1979@cumin2003 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr2-eqdfw cr2-eqdfw IPv6 with reason: junos upgrade [15:02:42] !log pt1979@cumin2003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr2-eqdfw,cr2-eqdfw IPv6 with reason: junos upgrade [15:06:14] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1247: Migration of db1247.eqiad.wmnet completed [15:08:43] !log jmm@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host build2004.codfw.wmnet with OS trixie [15:10:18] (03PS17) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [15:10:29] (03PS18) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [15:10:35] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:10:42] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:11:16] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2245: Migration of db2245.codfw.wmnet completed [15:12:30] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [15:19:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12054790 (10Jhancock.wm) 05Resolved→03Open @Marostegui i rechecked just now. i guess i was wrong. lemme know when you are done and i'll go reseat some cables. [15:20:48] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12054800 (10Marostegui) Yeah it's not booting :(. You can go ahead and do anything you need to do. Thanks! [15:22:00] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-06-23-135458 to 2026-06-25-145651 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305698 (https://phabricator.wikimedia.org/T416144) (owner: 10Jforrester) [15:22:36] Need to do a private code deploy if there aren't any objections. I don't think anything is happening in this window right now. [15:23:15] Tran: I'm deploying a service but it won't affect MW-land. [15:23:39] JennH: hey i am going to wait another 5 minutes and start the the junos upgrade on the router and reboot it looks like all the transports links are drainned now [15:24:17] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-06-23-135458 to 2026-06-25-145651 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305698 (https://phabricator.wikimedia.org/T416144) (owner: 10Jforrester) [15:24:38] James_F: Should I still wait until you're done? [15:24:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:24:51] Tran: No, just go for it. [15:25:00] alright, thanks. Starting then. [15:25:13] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:26:02] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:26:14] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:26:44] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:26:50] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:27:18] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:28:01] (Done.) [15:29:24] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:31:58] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:32:30] FIRING: Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:33:39] (03PS15) 10Btullis: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [15:34:28] (03PS1) 10Volans: systemd::path: fix empty Unit= in path unit [puppet] - 10https://gerrit.wikimedia.org/r/1305701 [15:34:28] (03CR) 10Volans: "It seems that this class is currently unused but I was planning to use it in the next patch in the series, and discovered the typo." [puppet] - 10https://gerrit.wikimedia.org/r/1305701 (owner: 10Volans) [15:36:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [15:36:18] (03PS1) 10Volans: utils: skip .mypy_cache in run_ci_locally.sh [puppet] - 10https://gerrit.wikimedia.org/r/1305703 [15:36:18] (03CR) 10Volans: "I've encountered this issue when running CI "locally" inside a VM on my laptop via [1]." [puppet] - 10https://gerrit.wikimedia.org/r/1305703 (owner: 10Volans) [15:37:30] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:39:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:40:21] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [15:41:05] Done as well [15:45:28] (03PS1) 10C. Scott Ananian: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) [15:47:10] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [15:48:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [15:49:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:16] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:16] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:26] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:26] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:26] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:35] (03CR) 10Elukey: "@brouberol@wikimedia.org I am chatting with Tiziano, I think that this is a good occasion to do some cleanup.. the kafka mirror profile is" [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [15:49:36] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:50:22] (03CR) 10Elukey: [C:03+2] pontoon: add config for the kafka-upgrade stack used for testing [puppet] - 10https://gerrit.wikimedia.org/r/1305620 (owner: 10Elukey) [15:50:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-esams and cr2-eqdfw (208.80.153.217) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:50:49] (03CR) 10Btullis: [C:03+2] presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:51:45] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1247: Migration of db1247.eqiad.wmnet completed [15:51:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:52:30] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:52:50] (03CR) 10C. Scott Ananian: [C:03+2] Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [15:53:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[3456] - https://phabricator.wikimedia.org/T419892#12055099 (10fgiunchedi) a:05Andrew→03fgiunchedi Taking this on, I'll re-assign as needed once we have a path forward [15:53:30] (03CR) 10C. Scott Ananian: [C:04-2] Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [15:53:45] (03CR) 10Filippo Giunchedi: [C:03+1] systemd::path: fix empty Unit= in path unit [puppet] - 10https://gerrit.wikimedia.org/r/1305701 (owner: 10Volans) [15:54:10] FIRING: [5x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:54:42] (03CR) 10C. Scott Ananian: "Accidentally clicked C+2 on the wrong patch, whoops." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [15:54:54] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqdfw (208.80.153.198) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:56:42] (03CR) 10Arlolra: [C:03+1] Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [15:56:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2245: Migration of db2245.codfw.wmnet completed [15:56:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:57:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:16] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:26] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:28] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:28] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:30] RESOLVED: Traffic bill over quota: Alert for device cr2-esams.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:57:36] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:58] (03CR) 10FNegri: [C:03+1] utils: skip .mypy_cache in run_ci_locally.sh [puppet] - 10https://gerrit.wikimedia.org/r/1305703 (owner: 10Volans) [15:59:10] RESOLVED: [5x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:59:54] RESOLVED: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqdfw (208.80.153.198) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:00:05] jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:01:05] (03PS2) 10Fabfur: cache::haproxy: add correlation id feature [puppet] - 10https://gerrit.wikimedia.org/r/1305635 (https://phabricator.wikimedia.org/T426379) [16:02:20] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2234.codfw.wmnet with OS trixie [16:09:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:26] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12055186 (10Jhancock.wm) okay give it another shot. If it does it again I'm gonna open a ticket with Dell. Or we can try just leaving the riser out. I'm gonna leave the ticket open until you get a... [16:13:22] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:14:22] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:14:24] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:14:38] (03PS1) 10Cwhite: prometheus: add authentication parameters to es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1305718 (https://phabricator.wikimedia.org/T350516) [16:14:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:15:22] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:15:37] (03PS1) 10CWilliams: Allow a single replica for sre.mysql.major-upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1305682 (https://phabricator.wikimedia.org/T429758) [16:15:39] jhathaway, rzl , federico3, _joe_ if there's space in this window, I'd like to backport a new version of parsoid to wmf.8 before group2 rolls. [16:16:15] no objection from me, we had no puppet patches [16:16:39] <_joe_> +1 [16:16:49] cscott: can you elaborate on any risk? [16:17:03] !log pt1979@cumin2003 START - Cookbook sre.hosts.remove-downtime for cr2-eqdfw,cr2-eqdfw IPv6 [16:17:05] !log pt1979@cumin2003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr2-eqdfw,cr2-eqdfw IPv6 [16:17:26] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:17:50] PROBLEM - SSH on logstash1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:17:50] federico3: latest version of parsoid has passed our internal round-trip testing and fixes some corner case bugs with nested templates and links on templatedata pages which we'd like to have live before parsoid read views is enabled on english wikipedia [16:18:11] (03CR) 10Hnowlan: [C:03+2] redis: remove nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/1305075 (https://phabricator.wikimedia.org/T384924) (owner: 10Hnowlan) [16:18:24] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:18:24] federico3: i'm trying to get this in *before* the group2 roll so that we have at least a little bit of time to bake in group1 to smoke test before turning it on everywhere [16:18:31] FIRING: [2x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:40] RECOVERY - SSH on logstash1023 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:19:01] (that is, the out-of-window backport is an attempt to minimize the risk, since the alternative would be a backport in the "usual" window that would immediately go live to all of group 2) [16:19:25] federico3: ^ [16:19:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:20:08] (03PS2) 10Hnowlan: redis: clean up redis nrpe check components [puppet] - 10https://gerrit.wikimedia.org/r/1305077 (https://phabricator.wikimedia.org/T384924) [16:20:22] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [16:20:22] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [16:20:30] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [16:20:42] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1248: Upgrading db1248.eqiad.wmnet [16:20:46] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [16:20:46] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [16:21:07] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2246: Upgrading db2246.codfw.wmnet [16:21:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [16:21:40] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2246: Upgrading db2246.codfw.wmnet [16:21:43] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1248: Upgrading db1248.eqiad.wmnet [16:22:01] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [16:22:02] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a12 [vendor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305719 (https://phabricator.wikimedia.org/T353697) [16:23:10] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1248.eqiad.wmnet with OS trixie [16:23:31] RESOLVED: [2x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:54] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a12 [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305720 (https://phabricator.wikimedia.org/T429822) [16:24:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305720 (https://phabricator.wikimedia.org/T429822) (owner: 10C. Scott Ananian) [16:24:27] federico3: does that answer your question/concern? [16:24:49] cscott: yess, +1 from me [16:25:06] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2246.codfw.wmnet with OS trixie [16:26:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [vendor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305719 (https://phabricator.wikimedia.org/T353697) (owner: 10C. Scott Ananian) [16:27:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305719 (https://phabricator.wikimedia.org/T353697) (owner: 10C. Scott Ananian) [16:27:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305720 (https://phabricator.wikimedia.org/T429822) (owner: 10C. Scott Ananian) [16:27:37] (03PS1) 10Cwhite: prometheus: remove unused buster branch [puppet] - 10https://gerrit.wikimedia.org/r/1305721 [16:29:43] (03CR) 10Btullis: [C:03+2] presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [16:35:21] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a12 [vendor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305719 (https://phabricator.wikimedia.org/T353697) (owner: 10C. Scott Ananian) [16:36:06] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a12 [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305720 (https://phabricator.wikimedia.org/T429822) (owner: 10C. Scott Ananian) [16:36:37] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1305719|Bump wikimedia/parsoid to 0.24.0-a12 (T353697 T384490 T387374 T387520 T387521 T391624 T393295 T420336 T429624 T429688 T429822)]], [[gerrit:1305720|Bump wikimedia/parsoid to 0.24.0-a12 (T429822)]] [16:37:11] T353697: Parsoid/legacy parser {{Pre}} template rendering difference - https://phabricator.wikimedia.org/T353697 [16:37:11] T384490: Include directives on a line with headings prevent the legacy parser from generating section edit links - https://phabricator.wikimedia.org/T384490 [16:37:12] T387374: Compound templates prevent section edit links where legacy adds them - https://phabricator.wikimedia.org/T387374 [16:37:12] T387520: Support section edit links to nested templates - https://phabricator.wikimedia.org/T387520 [16:37:13] T387521: Section titles failing to resolve redirected templates - https://phabricator.wikimedia.org/T387521 [16:37:13] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [16:37:13] T393295: Measurement plan + Analysis for the "Get Started" experiment (WE1.2.17, FY24/25) - https://phabricator.wikimedia.org/T393295 [16:37:14] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [16:37:14] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [16:37:15] T429688: mw-empty-elt wrapping does not take DOMFragments into account - https://phabricator.wikimedia.org/T429688 [16:37:15] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [16:38:23] (03CR) 10Andrew Bogott: "Presumably the puppet certs exist in the first place to prevent some kind of mitm attack where a new vindictive puppetserver is injected i" [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [16:38:41] !log cscott@deploy1003 cscott: Backport for [[gerrit:1305719|Bump wikimedia/parsoid to 0.24.0-a12 (T353697 T384490 T387374 T387520 T387521 T391624 T393295 T420336 T429624 T429688 T429822)]], [[gerrit:1305720|Bump wikimedia/parsoid to 0.24.0-a12 (T429822)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:39:56] (03CR) 10Hnowlan: [C:03+1] logstash: send thumbor logs to test partition [puppet] - 10https://gerrit.wikimedia.org/r/1305260 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite) [16:40:30] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1248.eqiad.wmnet with reason: host reimage [16:41:22] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker1160.eqiad.wmnet with OS trixie [16:41:38] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2246.codfw.wmnet with reason: host reimage [16:41:55] !log jasmine@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1160 [16:42:01] FIRING: [8x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [16:42:13] !log cscott@deploy1003 cscott: Continuing with deployment [16:42:58] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [16:45:07] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1248.eqiad.wmnet with reason: host reimage [16:46:32] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305719|Bump wikimedia/parsoid to 0.24.0-a12 (T353697 T384490 T387374 T387520 T387521 T391624 T393295 T420336 T429624 T429688 T429822)]], [[gerrit:1305720|Bump wikimedia/parsoid to 0.24.0-a12 (T429822)]] (duration: 09m 54s) [16:46:55] T353697: Parsoid/legacy parser {{Pre}} template rendering difference - https://phabricator.wikimedia.org/T353697 [16:46:56] T384490: Include directives on a line with headings prevent the legacy parser from generating section edit links - https://phabricator.wikimedia.org/T384490 [16:46:56] T387374: Compound templates prevent section edit links where legacy adds them - https://phabricator.wikimedia.org/T387374 [16:46:57] T387520: Support section edit links to nested templates - https://phabricator.wikimedia.org/T387520 [16:46:58] T387521: Section titles failing to resolve redirected templates - https://phabricator.wikimedia.org/T387521 [16:46:58] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [16:46:58] T393295: Measurement plan + Analysis for the "Get Started" experiment (WE1.2.17, FY24/25) - https://phabricator.wikimedia.org/T393295 [16:46:59] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [16:46:59] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [16:47:00] T429688: mw-empty-elt wrapping does not take DOMFragments into account - https://phabricator.wikimedia.org/T429688 [16:47:00] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [16:47:17] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1160 - jasmine@cumin2002" [16:47:22] !log jasmine@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1160 - jasmine@cumin2002" [16:47:22] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:47:23] !log jasmine@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker1160.eqiad.wmnet 116.48.64.10.in-addr.arpa 6.1.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:47:26] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1160.eqiad.wmnet 116.48.64.10.in-addr.arpa 6.1.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:47:27] !log jasmine@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1160 [16:48:26] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) [16:48:51] !log jasmine@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1160 [16:48:51] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1160 [16:49:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [16:49:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in ulsfo #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:50:03] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2246.codfw.wmnet with reason: host reimage [16:50:20] (03Merged) 10jenkins-bot: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [16:50:32] looking [16:50:40] <_joe_> ulsfo [16:50:46] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1305711|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (T429624 T429822 T391624)]] [16:50:56] since 10 mins [16:52:51] !log cscott@deploy1003 cscott: Backport for [[gerrit:1305711|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (T429624 T429822 T391624)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:53:00] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [16:53:00] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [16:53:00] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [16:53:05] !ack [16:53:06] 8097 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet ulsfo) [16:53:08] !incidents [16:53:08] 8097 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet ulsfo) [16:54:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in ulsfo #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:59:14] !log cscott@deploy1003 cscott: Continuing with deployment [17:00:05] bd808: Your horoscope predicts another Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1700) [17:02:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [17:02:25] It looks like I can ship some developer-portal updates in today's window. [17:02:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1248.eqiad.wmnet with OS trixie [17:04:05] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1160.eqiad.wmnet with reason: host reimage [17:07:07] i'm just waiting for the tail end of a config deploy. seems like it's been stuck on 54 of 60 "k8s canaries" for a while [17:07:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2246.codfw.wmnet with OS trixie [17:09:22] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1160.eqiad.wmnet with reason: host reimage [17:12:32] federico3, _joe_, bd808 spiderpig failed: [17:12:41] https://www.irccloud.com/pastebin/v18aFWlz/ [17:13:05] how long was the failure and rollback? [17:13:18] i asked it to retry, but now it's stuck at 0 of 60. [17:13:26] the log says it failed after 10m and that seems about right. [17:14:00] <_joe_> cscott: we're currently looking into a page, let's see if someone else can help you [17:14:01] the running timer on spidepig says it's been trying to deploy the config change for 25m now, and that includes time its spent on the retry and time spend in the middle while arlo and i were testing on the testservers, etc [17:14:27] <_joe_> we're dealing with something a bit more urgent atm [17:14:58] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1248: Migration of db1248.eqiad.wmnet completed [17:15:10] (03CR) 10Thcipriani: [C:03+1] "Nice improvement! Readability is better and tests just fine. Nice work." [puppet] - 10https://gerrit.wikimedia.org/r/1302910 (owner: 10Ahmon Dancy) [17:15:42] cscott: taking a look! not impossible it's related to the page, but let's see [17:15:49] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-api-int/canary on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:16:19] ^ that alert is just identifying the same thing, the rollout is stuck/slow [17:16:36] (03PS1) 10BryanDavis: developer-portal: Bump container to 2026-06-25-122144-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305729 [17:19:12] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305730 [17:19:14] <_joe_> rzl: for context, the page is related to a few timeouts happening from time to time, nothing really that would justify a failed deployment [17:19:22] ack, thanks [17:19:46] it made it to 5/60, so it does seem to be progressing, just really really slowly [17:19:53] hm, looks like the canaries just- yeah [17:20:20] it just went backwards? back to 0/60 now. i didn't know that was possible. [17:20:49] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-api-int/canary on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:22:16] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2246: Migration of db2246.codfw.wmnet completed [17:22:32] the canaries are automatically rolling back, since the deployment timed out -- I don't know offhand where that progress number in scap comes from, but wouldn't shock me if it gets confused by that [17:22:51] I see you have another prompt, let me see what state we're in before you do anything [17:22:55] ok, it gave up and is asking to retry or roll back. [17:23:24] yeah i was going to ask what you thought i should do, i don't have high confidence at this point that 'roll back' would be any more successful. [17:23:33] i'll wait for yr advice [17:24:19] one failure (timeout) and successful rollback, then another failure and rollback, so we're successfully back where we started https://www.irccloud.com/pastebin/EO6tf7l9/ [17:25:27] and I see 2026-06-25-130643-webserver-bookworm on both the canaries and the rest, so confirming we're in a consistent state [17:25:29] but it's still on the test servers, right? if i hit "b" for "rollback all stages" it would try to rollback the test servers as well? [17:25:48] ah that's just mw-api-int yeah -- that's correct [17:25:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in ulsfo #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:26:10] (i have to confess spiderpig is wonderful when it works, but when it fails I never know exactly what state it is going to leave me in) [17:26:41] I see mediawiki-multiversion-debug:2026-06-25-165114-publish-83 at mw-debug yep [17:27:03] (which is probably not *spiderpig* per se but the underlying scap or whatever) [17:27:37] (03PS1) 10Bking: cirrussearch: Improve beta-cluster deploy [puppet] - 10https://gerrit.wikimedia.org/r/1305731 (https://phabricator.wikimedia.org/T425585) [17:27:49] without knowing anything specifically, 2x timeout roll-forward and successful rollback makes me suspect the problem is genuinely something about the new version. still looking though [17:28:08] !ack [17:28:08] 8098 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet ulsfo) [17:28:12] !incidents [17:28:13] 8098 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet ulsfo) [17:28:13] 8097 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet ulsfo) [17:28:35] (03CR) 10CI reject: [V:04-1] cirrussearch: Improve beta-cluster deploy [puppet] - 10https://gerrit.wikimedia.org/r/1305731 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [17:28:49] I suspect what you need is a relenger more than a serviceop :) but I'll still see what I can find out [17:29:01] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2026-06-25-122144-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305729 (owner: 10BryanDavis) [17:29:01] the patch is short but goes legit have lots of places for a bug to creep in: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1305711/1/wmf-config/CommonSettings.php [17:29:37] it did test ok on the test servers, though. [17:29:43] yeah [17:29:55] cscott, rzl: the job log shows that helm exited with a non-zero status. https://spiderpig.wikimedia.org/jobs/2406 [17:29:58] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1160.eqiad.wmnet with OS trixie [17:30:02] (wouldn't necessarily have to be a code bug either, sometimes we've seen things like creeping over a size threshold, that kind of thing) [17:30:18] bd808: correct, that's the "timeout exceeded" we're talking about [17:30:43] under discussion is why it took so long that helm gave up on it :) [17:30:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in ulsfo #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:31:18] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2026-06-25-122144-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305729 (owner: 10BryanDavis) [17:31:49] (03CR) 10Arlolra: [C:03+1] Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [17:32:17] (03PS2) 10Bking: cirrussearch: Improve beta-cluster deploy [puppet] - 10https://gerrit.wikimedia.org/r/1305731 (https://phabricator.wikimedia.org/T425585) [17:32:54] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:33:02] (03CR) 10CI reject: [V:04-1] cirrussearch: Improve beta-cluster deploy [puppet] - 10https://gerrit.wikimedia.org/r/1305731 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [17:33:08] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:33:23] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:34:09] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:34:21] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:34:36] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:34:43] (03PS3) 10Bking: cirrussearch: Improve beta-cluster deploy [puppet] - 10https://gerrit.wikimedia.org/r/1305731 (https://phabricator.wikimedia.org/T425585) [17:35:20] (03CR) 10CI reject: [V:04-1] cirrussearch: Improve beta-cluster deploy [puppet] - 10https://gerrit.wikimedia.org/r/1305731 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [17:36:53] (03PS4) 10Bking: cirrussearch: Improve beta-cluster deploy [puppet] - 10https://gerrit.wikimedia.org/r/1305731 (https://phabricator.wikimedia.org/T425585) [17:37:23] rzl: if you don't have any other ideas, i can roll back completely and rewrite the back in a more-obviously-correct way (just use 'group1' as a target in InitialiseSettings.php). We wanted to avoid the need for ops to do anything special if the needed to rollback group2, but it seems like maybe it is causing more trouble this way [17:37:33] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker1161.eqiad.wmnet with OS trixie [17:38:05] !log jasmine@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1161 [17:38:49] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [17:41:00] cscott: let's go ahead and rollback yeah [17:41:14] !log cscott@deploy1003 Rolling back deployment [17:41:19] I'm still not convinced either way whether the actual change is the problem, but I think we've got as much data as we can :/ [17:42:02] (03CR) 10Bking: [C:03+2] cirrussearch: Improve beta-cluster deploy [puppet] - 10https://gerrit.wikimedia.org/r/1305731 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [17:42:14] my understanding is this will remove the change from production but it's still committed on mediawiki-config git, so i should merge a revert there too to bring us back a consistent state. that sound right? [17:42:46] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305711|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (T429624 T429822 T391624)]] (duration: 51m 59s) [17:42:53] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [17:42:54] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [17:42:54] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [17:43:07] (03PS1) 10Arlolra: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' up to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 [17:43:11] cscott: I believe that's correct but not an expert [17:43:33] (03PS1) 10C. Scott Ananian: Revert "Turn on Parsoid's 'ReturnExperimentalPFragmentTypes'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305734 [17:43:38] cscott: that is correct, thanks [17:43:43] (03CR) 10C. Scott Ananian: [C:03+2] Revert "Turn on Parsoid's 'ReturnExperimentalPFragmentTypes'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305734 (owner: 10C. Scott Ananian) [17:43:55] (03CR) 10Ahmon Dancy: "Agreed. I will work on an update. I'll probably hit you up on IRC with some questions. In the meantime, you can drop the cherry-picked o" [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [17:44:12] (03CR) 10CI reject: [V:04-1] Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' up to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (owner: 10Arlolra) [17:44:43] (03Merged) 10jenkins-bot: Revert "Turn on Parsoid's 'ReturnExperimentalPFragmentTypes'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305734 (owner: 10C. Scott Ananian) [17:44:45] (03PS2) 10Arlolra: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' up to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 [17:44:57] jasmine@cumin2002 reimage (PID 2040345) is awaiting input [17:45:36] (03CR) 10CI reject: [V:04-1] Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' up to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (owner: 10Arlolra) [17:46:12] rzl: ok, arlo's made a much simpler version in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1305733, it just will require a further config deploy once wmf.8 rolls out to group2. [17:46:52] rzl: do you mind if we try to deploy that? that should help determine whether this is a scap issue or some subtle issue with the way i wrote the original config patch. [17:47:12] cscott: hold on for now, I think they need to some scapping for the ongoing incident first [17:47:26] federico3 is IC and can give you the all-clear when they're ready for you [17:47:36] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1161 - jasmine@cumin2002" [17:47:40] rzl, federico3 got it, thanks! [17:47:41] !log jasmine@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1161 - jasmine@cumin2002" [17:47:41] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:47:42] !log jasmine@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker1161.eqiad.wmnet 118.48.64.10.in-addr.arpa 8.1.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:47:46] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1161.eqiad.wmnet 118.48.64.10.in-addr.arpa 8.1.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:47:47] !log jasmine@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1161 [17:47:52] (03CR) 10C. Scott Ananian: [C:03+1] Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' up to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (owner: 10Arlolra) [17:48:37] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [17:50:29] (03CR) 10BryanDavis: "The fix being attempted here is a forever bug that Yuvi introduced when making it possible to have a project local puppetserver ( RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:50:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304706 (https://phabricator.wikimedia.org/T428876) (owner: 10DDesouza) [17:53:03] (03CR) 10BryanDavis: "Hmmm I guess `role::puppetserver::cloud_vps_project` is only on the local puppetserver. The `puppetmaster` hiera check or a net new hiera " [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [17:53:43] (03CR) 10Ahmon Dancy: "My intention is to find some way to inject the desired ca.pem file via hiera." [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [17:56:13] <_joe_> cscott: ping [17:56:34] pong [17:58:32] !log jasmine@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1161 [17:58:32] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1161 [18:00:05] brennen and jeena: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1800). [18:00:26] (incident still incidenting) [18:00:28] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1248: Migration of db1248.eqiad.wmnet completed [18:00:29] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [18:00:39] rzl thanks, will stand by [18:00:47] jeena: federico3 will let you know [18:00:53] 👍 [18:01:58] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#12055962 (10VRiley-WMF) Thanks @fgiunchedi Would you be able to do this tomorrow or sometime next week? [18:02:05] yup, we are in the middle of a rollback, it'll take a while [18:03:08] !log reedy@deploy1003 Synchronized private/: (no justification provided) (duration: 06m 14s) [18:07:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2246: Migration of db2246.codfw.wmnet completed [18:07:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [18:08:20] cscott, jeena: once the incident is all clear, it's possible rolling forward cscott's change would succeed, but of course now we're into the train window :) I'll let the two of you coordinate on relative pros and cons [18:09:04] We do have backport window right after as well [18:09:31] at this point we'll probably just deploy in the usual "late" backport window after the train. [18:15:15] federico3: okay with you if deployments go ahead? [18:15:30] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1161.eqiad.wmnet with reason: host reimage [18:15:55] (03CR) 10Dzahn: [C:03+2] ci: load mod_ssl in httpd to be able to proxy https [puppet] - 10https://gerrit.wikimedia.org/r/1305531 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:15:58] rzl: the incident is resolved but do you have the deployment window now? [18:16:06] if yes, you are good to go [18:16:31] I'm not deploying anything, I just promised the current deployers the IC would give them the go-ahead when it was time :) [18:18:01] Then if there are no objections, I will start the train deploy [18:19:03] I think you're all good [18:20:15] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1161.eqiad.wmnet with reason: host reimage [18:20:53] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [18:21:06] (03PS1) 10TrainBranchBot: group2 to 1.47.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305736 (https://phabricator.wikimedia.org/T423917) [18:21:09] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305736 (https://phabricator.wikimedia.org/T423917) (owner: 10TrainBranchBot) [18:21:42] (03CR) 10Dzahn: [C:03+2] phabricator: drop diffusion.allow-http-auth config [puppet] - 10https://gerrit.wikimedia.org/r/1305040 (https://phabricator.wikimedia.org/T418045) (owner: 10Aklapper) [18:22:08] (03Merged) 10jenkins-bot: group2 to 1.47.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305736 (https://phabricator.wikimedia.org/T423917) (owner: 10TrainBranchBot) [18:23:49] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:25:18] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host zuul1004 [18:26:10] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host zuul1004 [18:26:35] (03CR) 10Dzahn: [C:03+2] "apache2ctl -M | grep ssl" [puppet] - 10https://gerrit.wikimedia.org/r/1305531 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:28:26] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.47.0-wmf.8 refs T423917 [18:28:31] T423917: 1.47.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T423917 [18:33:08] cscott: you can use the rest of the train window if you want to [18:34:33] I should probably get a patch deployed if it's free... [18:34:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:et-0/0/0 (Transport: Arelion (IC-398709) {#20260602}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:34:55] jeena, rzl, federico3 does it make sense to try to deploy the config patch from before, to see if the deploy issues were resolved with the 🔥? [18:35:10] Reedy: go ahead, i'm just going to cause trouble :) [18:36:07] sounds worth trying to me! best case, it succeeds; worst case, it times out and we roll back again, having learned something [18:36:26] (03PS2) 10Reedy: InitialiseSettings: Require 2FA for all on arbcom_*wiki and conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305501 (https://phabricator.wikimedia.org/T428103) [18:36:30] (03CR) 10Reedy: [C:03+2] InitialiseSettings: Require 2FA for all on arbcom_*wiki and conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305501 (https://phabricator.wikimedia.org/T428103) (owner: 10Reedy) [18:36:36] Reedy: happy to let you go first, though [18:37:19] (where in "we" I'm really only counting myself as moral support) [18:37:24] Reedy will demonstrate that ordinary patches to mediawiki-config do/do not have problems :) [18:37:32] heh [18:37:42] (03Merged) 10jenkins-bot: InitialiseSettings: Require 2FA for all on arbcom_*wiki and conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305501 (https://phabricator.wikimedia.org/T428103) (owner: 10Reedy) [18:38:14] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1305501|InitialiseSettings: Require 2FA for all on arbcom_*wiki and conductwiki (T428103)]] [18:38:19] T428103: Enforce 2FA for all users on private wikis in WMF production - https://phabricator.wikimedia.org/T428103 [18:39:29] (03PS5) 10Ahmon Dancy: modules/beta/files/wmf-beta-update-databases.py: Keep update.php jobs topped up [puppet] - 10https://gerrit.wikimedia.org/r/1302910 [18:39:46] Reedy: which you're waiting for jenkin/zuul/scap, want to look at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1305711/1/wmf-config/CommonSettings.php briefly and see if you see anything obviously broken there? When we deployed it earlier it made it to testservers and arlo and i successfully tested it there, then the k8s-canaries only got to 55/60 before getting stuck, and we eventually rolled back. [18:40:18] !log reedy@deploy1003 reedy: Backport for [[gerrit:1305501|InitialiseSettings: Require 2FA for all on arbcom_*wiki and conductwiki (T428103)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:40:26] Was there any error? [18:40:56] Doing some potential extra/early autoloading could be a source of issues [18:41:06] depending on how heavy those classes are [18:41:11] !log reedy@deploy1003 reedy: Continuing with deployment [18:41:21] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host zuul1004 [18:41:40] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1161.eqiad.wmnet with OS trixie [18:43:07] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host zuul1004 [18:45:05] Reedy: hm, autoloading is a good thought. the only thing being invoked is composer machinery, but maybe that's not safe yet. [18:45:05] https://github.com/wikimedia/mediawiki-services-parsoid/blob/master/src/Parsoid.php#L75 [18:45:27] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305501|InitialiseSettings: Require 2FA for all on arbcom_*wiki and conductwiki (T428103)]] (duration: 07m 12s) [18:45:31] Reedy: no error, and it worked on test servers. [18:45:31] T428103: Enforce 2FA for all users on private wikis in WMF production - https://phabricator.wikimedia.org/T428103 [18:46:57] (03PS2) 10RLazarus: coredns: Parameterize `name` and `k8s-app` [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305545 (https://phabricator.wikimedia.org/T427864) [18:46:57] (03PS2) 10RLazarus: coredns: Add an internal_only value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305546 (https://phabricator.wikimedia.org/T427864) [18:46:57] (03PS2) 10RLazarus: admin_ng: Install coredns-internalonly in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305547 (https://phabricator.wikimedia.org/T427864) [18:48:16] Reedy: you done? [18:48:16] cscott: I guess now jeena is done... unless there's a rollback ;) [18:48:19] Yeah [18:49:28] yes i am done as well [18:50:01] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [18:50:39] ok, i'm going to brave and/or foolish and try the same config patch again? we'll learn something and hopefully I don't walk away with a t-shirt. [18:50:56] :) [18:51:57] My comment was more you don't need that scaffolding around the config if .8 is stable :P [18:52:29] (03PS1) 10C. Scott Ananian: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305740 (https://phabricator.wikimedia.org/T429624) [18:53:01] (03PS6) 10Ahmon Dancy: modules/beta/files/wmf-beta-update-databases.py: Keep update.php jobs topped up [puppet] - 10https://gerrit.wikimedia.org/r/1302910 [18:53:03] Reedy, jeena: we were trying to avoid extra steps in case wmf.8 needs to be rolled back at some point [18:53:28] could just vary on $wgVersion [18:53:39] depending on what ends up rolled back, I guess [18:54:01] (yeah, there's also the chance that parsoid gets rolled back) [18:55:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:55:41] (03PS3) 10C. Scott Ananian: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (owner: 10Arlolra) [18:56:04] vriley@cumin1003 netbox (PID 3168879) is awaiting input [18:56:07] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1305733 is the simpler version, we could just go with that an a note to the train conductors [18:57:32] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker1162.eqiad.wmnet with OS trixie [18:57:47] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [zuul1004] - vriley@cumin1003" [18:57:52] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [zuul1004] - vriley@cumin1003" [18:57:52] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:57:54] !log jasmine@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1162 [18:58:46] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [18:59:33] (03PS4) 10C. Scott Ananian: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (owner: 10Arlolra) [19:00:24] ok, new plan, deploy simpler patch, try not to break anything [19:00:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:00:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (owner: 10Arlolra) [19:00:52] (03CR) 10Arlolra: [C:03+1] Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (owner: 10Arlolra) [19:01:31] (03Abandoned) 10C. Scott Ananian: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305740 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [19:01:57] (03PS5) 10C. Scott Ananian: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (https://phabricator.wikimedia.org/T429624) (owner: 10Arlolra) [19:03:02] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [zuul1004] - vriley@cumin1003" [19:03:06] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [zuul1004] - vriley@cumin1003" [19:03:07] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:03:07] (03PS6) 10C. Scott Ananian: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (https://phabricator.wikimedia.org/T429624) (owner: 10Arlolra) [19:03:27] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [19:04:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (https://phabricator.wikimedia.org/T429624) (owner: 10Arlolra) [19:05:19] (03Merged) 10jenkins-bot: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305733 (https://phabricator.wikimedia.org/T429624) (owner: 10Arlolra) [19:05:36] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1305733|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (T429624 T429822 T391624)]] [19:05:36] (03PS1) 10Eric Gardner: Roll back mobile image carousel from sitewide to beta opt-in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305741 (https://phabricator.wikimedia.org/T429414) [19:05:43] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [19:05:44] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [19:05:44] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [19:05:56] (03CR) 10Ahmon Dancy: [V:03+1] "Beta-only change, live in beta. Works great." [puppet] - 10https://gerrit.wikimedia.org/r/1302910 (owner: 10Ahmon Dancy) [19:06:26] (03CR) 10Eric Gardner: "See also https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MultimediaViewer/+/1305258" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305741 (https://phabricator.wikimedia.org/T429414) (owner: 10Eric Gardner) [19:06:29] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:06:30] !log jasmine@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker1162.eqiad.wmnet 108.48.64.10.in-addr.arpa 8.0.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:06:33] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1162.eqiad.wmnet 108.48.64.10.in-addr.arpa 8.0.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:06:34] !log jasmine@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1162 [19:07:03] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [19:07:05] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426862 - bking@cumin2002 [19:07:06] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426862 - bking@cumin2002 [19:07:10] T426862: Ensure cookbooks work with OpenSearch 2.x - https://phabricator.wikimedia.org/T426862 [19:07:28] !log cscott@deploy1003 cscott, arlolra: Backport for [[gerrit:1305733|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (T429624 T429822 T391624)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:07:38] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426862 - bking@cumin2002 [19:07:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426862 - bking@cumin2002 [19:08:13] !log jasmine@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1162 [19:08:13] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1162 [19:08:14] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426862 - bking@cumin2002 [19:08:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426862 - bking@cumin2002 [19:08:46] !log bking@cumin2003 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426862 - bking@cumin2003 [19:08:47] !log bking@cumin2003 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426862 - bking@cumin2003 [19:10:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:10:12] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host zuul1004 [19:11:00] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1231 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [19:11:02] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1231 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T430219 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [19:11:11] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1231 - https://phabricator.wikimedia.org/T430219 (10ops-monitoring-bot) 03NEW [19:11:24] !log cscott@deploy1003 cscott, arlolra: Continuing with deployment [19:11:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host zuul1004 [19:12:15] 👀 [19:12:30] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host zuul1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:12:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2112:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2112 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:12:51] rzl: this is the simpler patch, though. i wasn't brave enough to try to original again. [19:12:55] nod [19:13:24] so now we're just waiting to see if scap has a grudge against me personally [19:13:51] (or i guess, with the underlying parsoid feature this turns on) [19:14:28] scap: stymieing cscott and parsoid [19:14:55] (03PS1) 101Veertje: Fix login form: use label elements instead of th for accessibility [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305742 [19:15:02] it seems to have paused at 49/60 again, so i'm not feeling good [19:15:13] (03CR) 10Dzahn: "I really have no involvement in this topic which makes me uncomfortable being the one asked to merge these things." [puppet] - 10https://gerrit.wikimedia.org/r/1302910 (owner: 10Ahmon Dancy) [19:16:34] I'd be curious if there's anything in the logs for the deployment on the k8s side? Like if it's having trouble with health checks or image pulling or what. [19:17:11] the timeout message from scap doesn't say much, obviously :) [19:17:21] vriley@cumin1003 provision (PID 3198192) is awaiting input [19:17:22] (03CR) 10Ahmon Dancy: [V:03+1] "Sorry Daniel. I'll figure out some other way to handle these requests." [puppet] - 10https://gerrit.wikimedia.org/r/1302910 (owner: 10Ahmon Dancy) [19:17:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2112:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2112 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:17:52] mw-web.eqiad.canary-799dffbbb6-7bcj7 appears stuck, digging a little [19:18:00] Warning NodeNotReady 8m35s node-controller Node is not ready [19:18:13] (03CR) 10WMDE-Fisch: [C:03+1] Remove wgCiteRemoveSyntheticRefsUnsafe feature flag from production and beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305639 (https://phabricator.wikimedia.org/T428232) (owner: 10Mareike Heuer) [19:18:14] yeah, anything that could help diagnose this. we've run 184,867 pages through this configuration on our rt testing servers, and i manually tested pages while this was on the test servers, so whatever's going on isn't an "obviously catches fire and dies" issue [19:18:25] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host zuul1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:18:30] Status: Terminating (lasts 5m33s) [19:18:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305639 (https://phabricator.wikimedia.org/T428232) (owner: 10Mareike Heuer) [19:19:49] 10ops-codfw, 06DC-Ops: Power Supply - PS2 Status - issue on wikikube-worker2315:9290 - https://phabricator.wikimedia.org/T430220 (10phaultfinder) 03NEW [19:20:55] actually hang on. this is on wikikube-worker1162 which jasmine_ is reimaging right now [19:21:03] https://logstash.wikimedia.org/app/dashboards#/view/d43f9bf0-17b5-11eb-b848-090a7444f26c?_g=h@0b3e9ae&_a=h@3d9768c [19:21:06] perhaps? [19:21:08] I wonder if there's something wrong with the reimage process, there shouldn't still be any work there [19:21:26] oh sorry yes I'm re imaging 1162 at current [19:21:35] jasmine_: there's some sort of drain and cordon step in the process, right? I haven't looked closely yet [19:21:52] but something there isn't right, that's what's getting us here [19:21:58] the timing lines up with the first deploy too [19:22:18] https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors?_g=h@adfdeeb&_a=h@e82d143 might be related? "could not acquire page lock"? [19:22:44] (but why does it work when Reedy does it) [19:23:09] broken share link :P [19:23:30] if my mental model is right here, you'd have to be shutting down a container on the node as it's being reimaged, Reedy is just luckier than you [19:23:36] rzl: i believe so, it's currently generating a new puppet certificate, is it perhaps that I ran the re-image during the deploy window? [19:23:54] i was *joking* when i said I was just waiting to see if scap had a grudge against me personally [19:24:04] I'm not sure I would say I'm lucky with many other things in my life [19:24:17] jasmine_: during the deploy window ought to be fine, but the reason it's fine is that we ought to start by draining the node and moving its workload to other workers [19:24:36] it looks like that didn't happen here, which is why the reimage is affecting live work, which is bad [19:24:40] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1162.eqiad.wmnet with reason: host reimage [19:25:09] (*or* we did that and it didn't work, because the container failed to shut down for whatever reason? feels less likely but not impossible) [19:25:32] still looking, let's sort that out before you start the next host [19:26:07] Reedy: it's the "could not acquire lock for page ID" item from https://spiderpig.wikimedia.org/mediawiki/logs [19:26:17] cscott: go ahead and roll this one back, sorry, just so that it's not sitting there -- but now that we think we have a root cause, we should be able to unblock you [19:26:30] !log cscott@deploy1003 Rolling back deployment [19:27:28] !log ayounsi@cumin1003 START - Cookbook sre.mysql.pool pool es2045: rack depool [19:27:43] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305733|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (T429624 T429822 T391624)]] (duration: 22m 07s) [19:27:50] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [19:27:51] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [19:27:51] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [19:28:07] (03PS1) 10C. Scott Ananian: Revert "Turn on Parsoid's 'ReturnExperimentalPFragmentTypes'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305744 [19:28:11] rzl: ah right, looking at cookbook outputs it doesn't appear to drain - let me figure out why that didn't happen [19:28:12] (03CR) 10C. Scott Ananian: [C:03+2] Revert "Turn on Parsoid's 'ReturnExperimentalPFragmentTypes'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305744 (owner: 10C. Scott Ananian) [19:28:19] cscott: sorry, my bad! [19:29:04] (03PS2) 101Veertje: Fix login form: use label elements instead of th for accessibility [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305742 [19:29:08] no worries, i'd really rather the problem not be some mysterious bug in my code :) [19:29:41] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1162.eqiad.wmnet with reason: host reimage [19:30:16] o right https://logstash.wikimedia.org/app/dashboards#/view/d43f9bf0-17b5-11eb-b848-090a7444f26c?_g=h@0b3e9ae&_a=h@4c7e19d [19:30:17] (03Merged) 10jenkins-bot: Revert "Turn on Parsoid's 'ReturnExperimentalPFragmentTypes'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305744 (owner: 10C. Scott Ananian) [19:31:48] (03PS1) 101Veertje: Fix login form: responsive labels and mobile-friendly layout [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305745 [19:38:24] rzl, jasmine_: ok, the config change is rolled back. let me know when you think it would be ok to retry? [19:41:43] cscott: give it a try now, 1162 is empty and jasmine_ is going to make sure the next one isn't impactful before moving on [19:42:19] sorry, I should have thought enough to say "roll back in scap but don't revert the patch, it won't be long" [19:43:49] (03PS1) 10C. Scott Ananian: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305746 (https://phabricator.wikimedia.org/T429624) [19:44:02] ok, here goes. [19:44:10] i'm not going to say anything bad about scap this time [19:45:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305746 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [19:49:09] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1162.eqiad.wmnet with OS trixie [19:49:48] (03Merged) 10jenkins-bot: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305746 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [19:50:27] (03PS2) 10Eric Gardner: Roll back mobile image carousel from sitewide to beta opt-in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305741 (https://phabricator.wikimedia.org/T429414) [19:52:01] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1305746|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (take 2) (T429624 T429822 T391624)]] [19:52:08] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [19:52:09] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [19:52:11] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [19:53:30] !log cscott@deploy1003 cscott: Backport for [[gerrit:1305746|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (take 2) (T429624 T429822 T391624)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:54:41] !log cscott@deploy1003 cscott: Continuing with deployment [19:54:50] i can't watch [19:55:37] ohoho [19:56:28] (03PS1) 10BPirkle: REST: remove obsolete and unnecessary config entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304173 (https://phabricator.wikimedia.org/T422770) [19:57:17] (03CR) 10BPirkle: "Prerequisite core change is on all wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304173 (https://phabricator.wikimedia.org/T422770) (owner: 10BPirkle) [19:58:48] (03PS2) 10BPirkle: REST: remove obsolete and unnecessary config entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304173 (https://phabricator.wikimedia.org/T422770) [19:58:59] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305746|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (take 2) (T429624 T429822 T391624)]] (duration: 06m 58s) [19:59:04] success! [19:59:05] \o/ [19:59:07] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [19:59:07] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [19:59:08] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [19:59:12] rzl: thanks for all of your help [19:59:13] thanks for your patience :) [19:59:30] i'm just glad that in the end it wasn't my bug [19:59:33] i have enough t-shirts already [20:00:04] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T2000). [20:00:04] abijeet, arlolra, cscott, danisztls, and WMDE-Fisch: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] https://usercontent.irccloud-cdn.com/file/MsiodWag/image.png [20:00:23] \o [20:00:36] o/ [20:00:36] My patch is pure cleanup, does not need testing and can be paired with anything else in the config repo ;-) [20:01:12] Mine is just undeploying a survey and returning to previous state [20:01:14] (03PS1) 10Jdlrobson: Restore carousel beta opt-in, suppressed when enabled sitewide [extensions/MultimediaViewer] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305749 (https://phabricator.wikimedia.org/T429414) [20:01:24] I can fill in for abijeet [20:01:50] (03CR) 10Eric Gardner: [C:03+1] Restore carousel beta opt-in, suppressed when enabled sitewide [extensions/MultimediaViewer] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305749 (https://phabricator.wikimedia.org/T429414) (owner: 10Jdlrobson) [20:02:23] does anyone need a deployer [20:02:45] I do [20:03:35] o/ [20:03:46] I also would appreciate mine to be taken care of :-) [20:03:47] o/ [20:03:54] okay [20:04:07] And as mentioned mine is fine to be paired with any other [20:04:23] I can do mine deploy and combine others [20:04:49] okay arlolra do you want to go ahead? [20:05:16] sure [20:05:23] Nikerabbit: should I combine yours? [20:05:34] arlolra: feel free [20:07:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304888 (https://phabricator.wikimedia.org/T429830) (owner: 10Arlolra) [20:07:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305290 (owner: 10Abijeet Patro) [20:07:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305639 (https://phabricator.wikimedia.org/T428232) (owner: 10Mareike Heuer) [20:08:32] (03Merged) 10jenkins-bot: Deploy PRV to 5 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304888 (https://phabricator.wikimedia.org/T429830) (owner: 10Arlolra) [20:08:35] (03Merged) 10jenkins-bot: Enable ULS v2 by default across all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305290 (owner: 10Abijeet Patro) [20:08:38] (03Merged) 10jenkins-bot: Remove wgCiteRemoveSyntheticRefsUnsafe feature flag from production and beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305639 (https://phabricator.wikimedia.org/T428232) (owner: 10Mareike Heuer) [20:08:54] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1304888|Deploy PRV to 5 wikis (T429830)]], [[gerrit:1305290|Enable ULS v2 by default across all wikis]], [[gerrit:1305639|Remove wgCiteRemoveSyntheticRefsUnsafe feature flag from production and beta cluster config (T428232)]] [20:09:02] T429830: Parsoid Read Views to deploy ~2026-06-25 - https://phabricator.wikimedia.org/T429830 [20:09:02] T428232: [Cleanup] Remove code that creates or depends on synthetic main refs - https://phabricator.wikimedia.org/T428232 [20:10:49] !log arlolra@deploy1003 arlolra, mareikeheuer, abi: Backport for [[gerrit:1304888|Deploy PRV to 5 wikis (T429830)]], [[gerrit:1305290|Enable ULS v2 by default across all wikis]], [[gerrit:1305639|Remove wgCiteRemoveSyntheticRefsUnsafe feature flag from production and beta cluster config (T428232)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:11:27] Nikerabbit: if there's anything to test [20:12:29] arlolra: yep, I see the new version on test servers so it's good [20:12:35] thanks [20:12:41] !log arlolra@deploy1003 arlolra, mareikeheuer, abi: Continuing with deployment [20:12:55] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2045: rack depool [20:15:48] (03CR) 10JHathaway: [C:03+2] utils: skip .mypy_cache in run_ci_locally.sh [puppet] - 10https://gerrit.wikimedia.org/r/1305703 (owner: 10Volans) [20:16:20] (03PS2) 10Volans: utils: skip .mypy_cache in run_ci_locally.sh [puppet] - 10https://gerrit.wikimedia.org/r/1305703 [20:16:32] (03CR) 10JHathaway: [C:03+2] utils: skip .mypy_cache in run_ci_locally.sh [puppet] - 10https://gerrit.wikimedia.org/r/1305703 (owner: 10Volans) [20:16:58] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1304888|Deploy PRV to 5 wikis (T429830)]], [[gerrit:1305290|Enable ULS v2 by default across all wikis]], [[gerrit:1305639|Remove wgCiteRemoveSyntheticRefsUnsafe feature flag from production and beta cluster config (T428232)]] (duration: 08m 04s) [20:17:04] T429830: Parsoid Read Views to deploy ~2026-06-25 - https://phabricator.wikimedia.org/T429830 [20:17:05] T428232: [Cleanup] Remove code that creates or depends on synthetic main refs - https://phabricator.wikimedia.org/T428232 [20:17:18] ty! [20:17:29] I think cscott has done enough deploying for today. danisztls you're up [20:18:23] arlolra: ok, thanks [20:18:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304706 (https://phabricator.wikimedia.org/T428876) (owner: 10DDesouza) [20:24:33] rzl: Would you be willing to +2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1302910 ? [20:34:38] dancy: looking [20:36:12] sure, not reviewing the python at all but since it's beta only and Tyler already +1ed, happy to merge [20:36:23] (03CR) 10RLazarus: [C:03+2] modules/beta/files/wmf-beta-update-databases.py: Keep update.php jobs topped up [puppet] - 10https://gerrit.wikimedia.org/r/1302910 (owner: 10Ahmon Dancy) [20:36:23] Thanks! That's all I need. [20:36:47] assuming you don't need a manual puppet run anywhere, etc [20:37:07] "I'm assuming", that is -- not "as long as" :) [20:37:33] Nope [20:37:42] 👍 [20:37:54] puppet-merge done, you're all set [20:38:03] Thanks again! [20:38:38] (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304706 (https://phabricator.wikimedia.org/T428876) (owner: 10DDesouza) [20:42:16] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [20:42:44] (03PS3) 10BPirkle: REST: remove obsolete and unnecessary config entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304173 (https://phabricator.wikimedia.org/T422770) [20:46:00] gate-and-submit step appear to locked. No progress in 20m. I'm aborting the backport to recheck and try again. [20:47:17] (03PS4) 10BPirkle: REST: remove obsolete and unnecessary config entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304173 (https://phabricator.wikimedia.org/T422770) [20:51:50] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host zuul1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:55:01] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host zuul1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:55:23] (03PS2) 10DDesouza: Undeploy English Wikipedia Mobile App Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304706 (https://phabricator.wikimedia.org/T428876) [20:55:42] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [20:55:51] (03CR) 10TrainBranchBot: "Approved by dani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304706 (https://phabricator.wikimedia.org/T428876) (owner: 10DDesouza) [20:58:32] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:59:09] danisztls: we are restarting jenkins to hopefully unblock the jobs [20:59:28] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host zuul1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T2100) [21:00:21] jeena: it appears to have worked [21:02:34] vriley@cumin1003 provision (PID 3338075) is awaiting input [21:02:36] o/ let me know when i can take over the window [21:03:07] jeena: at least it is stuck at 100% now [21:03:21] 🫠 [21:03:52] (03Merged) 10jenkins-bot: Undeploy English Wikipedia Mobile App Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304706 (https://phabricator.wikimedia.org/T428876) (owner: 10DDesouza) [21:04:06] !log dani@deploy1003 Started scap sync-world: Backport for [[gerrit:1304706|Undeploy English Wikipedia Mobile App Survey (T428876)]] [21:04:11] T428876: Quick survey on Wikipedia - Mobile App Survey (WP25) - https://phabricator.wikimedia.org/T428876 [21:04:18] it merged [21:06:03] !log dani@deploy1003 dani: Backport for [[gerrit:1304706|Undeploy English Wikipedia Mobile App Survey (T428876)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:06:44] !log dani@deploy1003 dani: Continuing with deployment [21:09:05] danisztls: is this the last backport of the window? [21:10:32] Jdlrobson: I think so. Unless cscott is deploying. [21:10:56] Great. cscott we have made a promise to multiple communities to roll out a change so I will need to use this backport window as soon as this backport is done. [21:11:01] !log dani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1304706|Undeploy English Wikipedia Mobile App Survey (T428876)]] (duration: 06m 54s) [21:11:06] T428876: Quick survey on Wikipedia - Mobile App Survey (WP25) - https://phabricator.wikimedia.org/T428876 [21:11:36] Jdlrobson: mine is finished [21:11:47] TY [21:12:21] (03PS1) 10Dzahn: integration: include all proxy configs, including new for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1305759 (https://phabricator.wikimedia.org/T418521) [21:12:31] (03PS1) 10Bking: cirrussearch: fix opensearch environment var name [puppet] - 10https://gerrit.wikimedia.org/r/1305760 (https://phabricator.wikimedia.org/T425585) [21:13:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305749 (https://phabricator.wikimedia.org/T429414) (owner: 10Jdlrobson) [21:13:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305741 (https://phabricator.wikimedia.org/T429414) (owner: 10Eric Gardner) [21:13:52] Jdlrobson once changes are on test servers I can double check [21:14:43] (03Merged) 10jenkins-bot: Roll back mobile image carousel from sitewide to beta opt-in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305741 (https://phabricator.wikimedia.org/T429414) (owner: 10Eric Gardner) [21:14:55] (03Merged) 10jenkins-bot: Restore carousel beta opt-in, suppressed when enabled sitewide [extensions/MultimediaViewer] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305749 (https://phabricator.wikimedia.org/T429414) (owner: 10Jdlrobson) [21:15:12] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1305749|Restore carousel beta opt-in, suppressed when enabled sitewide (T429414)]], [[gerrit:1305741|Roll back mobile image carousel from sitewide to beta opt-in (T429414)]] [21:15:17] T429414: [Image Browsing] Launch image carousel across wikis - https://phabricator.wikimedia.org/T429414 [21:15:49] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host zuul1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:16:36] (03CR) 10Dzahn: [C:03+2] integration: include all proxy configs, including new for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1305759 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:17:12] (03CR) 10Cwhite: [C:03+2] prometheus: remove unused buster branch [puppet] - 10https://gerrit.wikimedia.org/r/1305721 (owner: 10Cwhite) [21:17:15] (03CR) 10Bking: [C:03+2] cirrussearch: fix opensearch environment var name [puppet] - 10https://gerrit.wikimedia.org/r/1305760 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [21:20:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:50] (03Abandoned) 10C. Scott Ananian: Revert "Fix format=json option for OAuth 1" [extensions/OAuth] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305512 (owner: 10C. Scott Ananian) [21:24:29] (03PS1) 10Ryan Kemper: Fix rolling-operation datetimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1305766 (https://phabricator.wikimedia.org/T426862) [21:24:45] (03PS2) 10Ryan Kemper: Fix rolling-operation datetimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1305766 (https://phabricator.wikimedia.org/T426862) [21:25:45] (03PS3) 10Ryan Kemper: Fix rolling-operation datetimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1305766 (https://phabricator.wikimedia.org/T426862) [21:26:58] (03PS1) 10Dzahn: integration: turn SSLProxyEngine on [puppet] - 10https://gerrit.wikimedia.org/r/1305767 (https://phabricator.wikimedia.org/T418521) [21:27:49] (03CR) 10Dzahn: [C:03+2] integration: turn SSLProxyEngine on [puppet] - 10https://gerrit.wikimedia.org/r/1305767 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:29:57] (03PS4) 10Ryan Kemper: Fix rolling-operation datetimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1305766 (https://phabricator.wikimedia.org/T426862) [21:32:05] !log bking@cumin2003 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426862 - bking@cumin2003 [21:32:10] T426862: Ensure cookbooks work with OpenSearch 2.x - https://phabricator.wikimedia.org/T426862 [21:33:08] !log jdlrobson@deploy1003 egardner, jdlrobson: Backport for [[gerrit:1305749|Restore carousel beta opt-in, suppressed when enabled sitewide (T429414)]], [[gerrit:1305741|Roll back mobile image carousel from sitewide to beta opt-in (T429414)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:33:12] T429414: [Image Browsing] Launch image carousel across wikis - https://phabricator.wikimedia.org/T429414 [21:33:36] EricGardner: you can test this now [21:33:47] Ok, checking now [21:35:16] EricGardner: LGTM and Szymon. Let me know if you have any objections to proceeding [21:35:28] LGTM too [21:35:33] !log jdlrobson@deploy1003 egardner, jdlrobson: Continuing with deployment [21:35:41] beta feature (and opting out of beta) works as expected [21:37:10] (03PS2) 10Cwhite: prometheus: add authentication parameters to es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1305718 (https://phabricator.wikimedia.org/T350516) [21:43:03] (03PS1) 10Cwhite: logstash: add parameters needed for security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1305769 (https://phabricator.wikimedia.org/T350516) [21:43:45] (03PS2) 10Cwhite: logstash: add parameters needed for security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1305769 (https://phabricator.wikimedia.org/T350516) [21:44:44] (03PS3) 10Cwhite: logstash: add parameters needed for security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1305769 (https://phabricator.wikimedia.org/T350516) [21:47:41] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305749|Restore carousel beta opt-in, suppressed when enabled sitewide (T429414)]], [[gerrit:1305741|Roll back mobile image carousel from sitewide to beta opt-in (T429414)]] (duration: 32m 28s) [21:47:45] T429414: [Image Browsing] Launch image carousel across wikis - https://phabricator.wikimedia.org/T429414 [21:54:50] (03CR) 10Bking: [C:03+1] Fix rolling-operation datetimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1305766 (https://phabricator.wikimedia.org/T426862) (owner: 10Ryan Kemper) [21:55:03] (03PS1) 10Ahmon Dancy: beta::autoupdater: Add tail-beta-update-logs utility [puppet] - 10https://gerrit.wikimedia.org/r/1305770 (https://phabricator.wikimedia.org/T256168) [21:55:26] (03CR) 10Bking: [C:03+1] "Confirmed that this change works via `test-cookbook`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305766 (https://phabricator.wikimedia.org/T426862) (owner: 10Ryan Kemper) [21:56:51] !log bking@cumin2003 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426862 - bking@cumin2003 [21:56:56] T426862: Ensure cookbooks work with OpenSearch 2.x - https://phabricator.wikimedia.org/T426862 [21:58:51] (03PS6) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [22:00:20] (03PS7) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [22:02:21] (03PS8) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [22:02:37] (03PS9) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server change [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [22:05:19] (03PS1) 10Dzahn: jenkins: enable jenkins service [puppet] - 10https://gerrit.wikimedia.org/r/1305772 (https://phabricator.wikimedia.org/T418521) [22:07:56] (03CR) 10Dzahn: [C:03+2] jenkins: enable jenkins service [puppet] - 10https://gerrit.wikimedia.org/r/1305772 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [22:08:12] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.4.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305289 (https://phabricator.wikimedia.org/T428984) (owner: 10Clare Ming) [22:10:39] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.4.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305289 (https://phabricator.wikimedia.org/T428984) (owner: 10Clare Ming) [22:11:04] RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [22:14:26] (03PS1) 10Jdrewniak: Phase 3 Legal contact link deployments. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305773 (https://phabricator.wikimedia.org/T430227) [22:15:49] (03CR) 10CI reject: [V:04-1] Phase 3 Legal contact link deployments. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305773 (https://phabricator.wikimedia.org/T430227) (owner: 10Jdrewniak) [22:20:37] (03CR) 10Ahmon Dancy: "Andrew, I've made revisions. The `profile::puppet::agent::puppetserver_ca_cert` hiera key must be set for this functionality to be enable" [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [22:21:17] (03CR) 10Milazg: [C:03+1] REST: remove obsolete and unnecessary config entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304173 (https://phabricator.wikimedia.org/T422770) (owner: 10BPirkle) [22:26:27] (03PS1) 10Cwhite: opensearch_dashboards: add settings required for security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1305774 (https://phabricator.wikimedia.org/T350516) [22:34:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:et-0/0/0 (Transport: Arelion (IC-398709) {#20260602}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:03:53] (03PS1) 10Ryan Kemper: query_service: correct lindas, drop dead endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1305778 (https://phabricator.wikimedia.org/T429791) [23:03:56] (03PS1) 10Ryan Kemper: query_service: allowlist Beyond Notability [puppet] - 10https://gerrit.wikimedia.org/r/1305779 (https://phabricator.wikimedia.org/T429810) [23:03:59] (03PS1) 10Ryan Kemper: query_service: allowlist QLever Commons [puppet] - 10https://gerrit.wikimedia.org/r/1305780 (https://phabricator.wikimedia.org/T429807) [23:14:28] (03PS1) 10Dzahn: Revert "jenkins: enable jenkins service" [puppet] - 10https://gerrit.wikimedia.org/r/1305782 [23:17:15] (03CR) 10Ryan Kemper: [C:03+2] query_service: correct lindas, drop dead endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1305778 (https://phabricator.wikimedia.org/T429791) (owner: 10Ryan Kemper) [23:17:18] (03CR) 10Ryan Kemper: [C:03+2] query_service: allowlist Beyond Notability [puppet] - 10https://gerrit.wikimedia.org/r/1305779 (https://phabricator.wikimedia.org/T429810) (owner: 10Ryan Kemper) [23:17:23] (03CR) 10Ryan Kemper: [C:03+2] query_service: allowlist QLever Commons [puppet] - 10https://gerrit.wikimedia.org/r/1305780 (https://phabricator.wikimedia.org/T429807) (owner: 10Ryan Kemper) [23:18:37] (03CR) 10Dzahn: [C:03+2] Revert "jenkins: enable jenkins service" [puppet] - 10https://gerrit.wikimedia.org/r/1305782 (owner: 10Dzahn) [23:29:07] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#12056903 (10Dwisehaupt) @Papaul @Jhancock.wm Sorry to pull up something on some old machines, but it looks like these two hosts are not on the network after moving to the new rack. I... [23:30:04] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [23:31:24] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [23:31:45] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [23:33:23] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frmx1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429529#12056919 (10Dwisehaupt) [23:33:25] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frqueue1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429520#12056921 (10Dwisehaupt) [23:33:48] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frqueue1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429520#12056924 (10Dwisehaupt) [23:34:04] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frmx1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429529#12056925 (10Dwisehaupt) [23:41:52] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [23:42:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1305784 [23:42:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1305784 (owner: 10TrainBranchBot) [23:47:38] (03PS1) 10Santiago Faci: growthbook: Updated chart to add API_RATE_LIMIT_MAX env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305785 (https://phabricator.wikimedia.org/T429420) [23:50:14] (03Abandoned) 10Santiago Faci: test-kitchen-next: Set `ui_url` explicitly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304092 (owner: 10Santiago Faci) [23:51:13] (03Abandoned) 10Santiago Faci: Test Kitchen UI: Restore log level to default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299573 (owner: 10Santiago Faci) [23:53:15] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1305784 (owner: 10TrainBranchBot) [23:56:18] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)