[00:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:27:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::ee38:7300:ce8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:32:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::ee38:7300:ce8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:36:54] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:37:38] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:51:54] PROBLEM - MariaDB Replica Lag: pc1 on pc2021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [01:11:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306137 [01:11:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306137 (owner: 10TrainBranchBot) [01:19:50] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306137 (owner: 10TrainBranchBot) [01:50:13] (03PS1) 10Gergő Tisza: [WIP] Remove security-related log hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306141 [01:51:30] (03CR) 10CI reject: [V:04-1] [WIP] Remove security-related log hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306141 (owner: 10Gergő Tisza) [01:54:05] (03PS2) 10Gergő Tisza: Remove security-related log hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306141 [02:00:23] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:29] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 07m 05s) [02:09:42] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [02:58:52] (03CR) 10LuniZunie: "Fuck this shit dude. I'm gonna go shoot up a school. On the english wikipedia you may find me at User:LuniZunie" [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306137 (owner: 10TrainBranchBot) [03:03:30] (03CR) 10LuniZunie: "> Fuck this shit dude. I'm gonna go shoot up a school. On the english wikipedia you may find me at User:LuniZunie" [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306137 (owner: 10TrainBranchBot) [03:07:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:36:54] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:37:38] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:55:17] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Switch to float16 and reduce context length for Qwen3.6-27B deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305919 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [05:20:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc2021: pc1 hw issues [05:20:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:20:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:20:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2021: pc1 hw issues [05:22:49] (03PS1) 10Marostegui: pc2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1306143 (https://phabricator.wikimedia.org/T430478) [05:23:26] (03CR) 10Marostegui: [C:03+2] pc2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1306143 (https://phabricator.wikimedia.org/T430478) (owner: 10Marostegui) [05:24:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on pc2021.codfw.wmnet,pc1021.eqiad.wmnet with reason: Debugging [05:33:52] RECOVERY - MariaDB Replica Lag: pc1 on pc2021 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [05:50:50] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12063674 (10lado85) Update. timeline is broken by small cyrillic letter х only if that cyrillic letter х is part of standard text. In links it works well. ~~~~ [06:04:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc2021: pc1 repool [06:04:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:04:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:04:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2021: pc1 repool [06:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [06:30:59] PROBLEM - Host cp6002 is DOWN: CRITICAL - Time to live exceeded (10.136.1.6) [06:31:17] RECOVERY - Host cp6002 is UP: PING OK - Packet loss = 0%, RTA = 87.16 ms [06:35:43] (03PS1) 10Muehlenhoff: Update account meta data for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1306148 [06:37:45] (03CR) 10Slyngshede: [C:03+1] Update account meta data for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1306148 (owner: 10Muehlenhoff) [06:40:41] (03CR) 10Muehlenhoff: [C:03+2] Update account meta data for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1306148 (owner: 10Muehlenhoff) [06:43:37] (03CR) 10Muehlenhoff: [C:03+2] Add Ahmon Dancy to releng-related approvals [puppet] - 10https://gerrit.wikimedia.org/r/1305566 (owner: 10Muehlenhoff) [06:59:50] (03PS1) 10Muehlenhoff: Update account metadata for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/1306149 [07:00:05] Amir1, urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:03] (03CR) 10Slyngshede: [C:03+2] C:dumps::web::xmldumps block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) (owner: 10Slyngshede) [07:02:24] !log bump space for prometheus k8s-dse in eqiad [07:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:25] (03PS1) 10Muehlenhoff: Update account medadata for migurski [puppet] - 10https://gerrit.wikimedia.org/r/1306150 [07:07:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:28] (03CR) 10Slyngshede: [C:03+1] Update account medadata for migurski [puppet] - 10https://gerrit.wikimedia.org/r/1306150 (owner: 10Muehlenhoff) [07:09:49] (03CR) 10Slyngshede: [C:03+1] Update account metadata for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/1306149 (owner: 10Muehlenhoff) [07:10:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software] - 10https://gerrit.wikimedia.org/r/1305837 (owner: 10Filippo Giunchedi) [07:14:53] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12063790 (10fgiunchedi) [07:17:35] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12063799 (10fgiunchedi) Hello @Monrac5, thank you for reaching out -- just to confirm: you are not part of WMDE staff, correct ? [07:19:18] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12063806 (10fgiunchedi) @KFrancis I could not find an NDA on file for Mona Thierse, would you mind arranging one? thank you so much! [07:20:38] (03Abandoned) 10Elukey: role::docker_registry: re-enable the blob cache [puppet] - 10https://gerrit.wikimedia.org/r/1304060 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [07:21:21] (03Abandoned) 10Elukey: sre.hosts.reimage: introduce wmfroot [cookbooks] - 10https://gerrit.wikimedia.org/r/1302160 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [07:22:11] (03Abandoned) 10Elukey: Turn paging on for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1203835 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [07:22:13] (03CR) 10Filippo Giunchedi: [C:03+2] clinic-duty: add Telxius multiple dates support [software] - 10https://gerrit.wikimedia.org/r/1305837 (owner: 10Filippo Giunchedi) [07:22:52] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12063820 (10SLyngshede-WMF) 05In progress→03Resolved User-Agent check as been deployed. [07:23:08] !log jmm@cumin2003 START - Cookbook sre.puppet.disable-merges [07:23:10] !log jmm@cumin2003 END (PASS) - Cookbook sre.puppet.disable-merges (exit_code=0) [07:23:19] (03CR) 10Muehlenhoff: [C:03+2] mirrors: Remove rsync [puppet] - 10https://gerrit.wikimedia.org/r/1304801 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [07:23:31] (03PS2) 10Volans: config: type config_file as PathLike[str] [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298541 [07:23:53] !log jmm@cumin2003 START - Cookbook sre.puppet.disable-merges [07:23:55] !log jmm@cumin2003 END (PASS) - Cookbook sre.puppet.disable-merges (exit_code=0) [07:24:11] (03Abandoned) 10Elukey: config: type config_file as PathLike[str] [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298541 (owner: 10Volans) [07:24:19] (03PS2) 10Volans: decorators: fix dynamic callbacks bug in retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298656 [07:27:54] (03CR) 10Jelto: [C:03+1] "lgtm, thank you! I added one comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [07:28:43] (03CR) 10Muehlenhoff: [C:03+2] Update account metadata for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/1306149 (owner: 10Muehlenhoff) [07:30:32] 06SRE, 06Product Safety and Integrity, 10iPoid-Service (IPoid OpenSearch): "IPoid request failed for IP" logs since 2026-06-25 - https://phabricator.wikimedia.org/T430484 (10mszwarc) 03NEW [07:33:05] (03CR) 10Elukey: [C:03+2] decorators: fix dynamic callbacks bug in retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298656 (owner: 10Volans) [07:33:10] (03CR) 10Muehlenhoff: [C:03+2] Update account medadata for migurski [puppet] - 10https://gerrit.wikimedia.org/r/1306150 (owner: 10Muehlenhoff) [07:33:51] 06SRE, 06Data-Platform-SRE, 06Product Safety and Integrity, 10iPoid-Service (IPoid OpenSearch): "IPoid request failed for IP" logs since 2026-06-25 - https://phabricator.wikimedia.org/T430484#12063882 (10kostajh) [07:34:10] (03CR) 10Elukey: [C:03+2] config: raise on missing INI file when raises=True [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298657 (owner: 10Volans) [07:34:22] (03PS2) 10Volans: config: raise on missing INI file when raises=True [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298657 [07:34:34] (03PS2) 10Volans: __init__: fail clearly when unknown __version__ [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298658 [07:34:40] (03PS2) 10Volans: phabricator: reject trailing newline in task ID [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298659 [07:34:45] (03PS2) 10Volans: dns: resolve() instead of deprecated query() [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298660 [07:34:52] (03PS2) 10Volans: actions: fix ActionsDict docstring example output [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298661 [07:34:58] (03PS2) 10Volans: interactive: fix ask_input Returns docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298662 [07:35:03] (03PS2) 10Volans: interactive: improve error message with validators [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298663 [07:35:08] (03PS2) 10Volans: irc: set the handler level via setLevel() [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298664 [07:42:36] (03PS1) 10Marostegui: Revert "pc2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1306156 [07:43:13] (03CR) 10Marostegui: [C:03+2] Revert "pc2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1306156 (owner: 10Marostegui) [07:45:22] (03PS1) 10Filippo Giunchedi: clinic-duty: fix Lumen notifications [software] - 10https://gerrit.wikimedia.org/r/1306157 [07:45:30] !log marostegui@cumin1003 conftool action : set/weight=100; selector: name=clouddb1026.eqiad.wmnet [07:49:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software] - 10https://gerrit.wikimedia.org/r/1306157 (owner: 10Filippo Giunchedi) [07:51:35] (03CR) 10Filippo Giunchedi: [C:03+2] clinic-duty: fix Lumen notifications [software] - 10https://gerrit.wikimedia.org/r/1306157 (owner: 10Filippo Giunchedi) [07:51:38] 06SRE, 06Data-Platform-SRE, 10iPoid-Service (IPoid OpenSearch), 06Product Safety and Integrity (Sprint 2026 (Jun 29 - Jul 17)): "IPoid request failed for IP" logs since 2026-06-25 - https://phabricator.wikimedia.org/T430484#12063954 (10OKryva-WMF) [07:52:27] (03CR) 10JMeybohm: [C:03+1] "small nit, other then that: LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305377 (https://phabricator.wikimedia.org/T427405) (owner: 10Blake) [07:53:16] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306158 [07:53:38] (03PS1) 10Filippo Giunchedi: site: put cloudvirt10[78-80] in service [puppet] - 10https://gerrit.wikimedia.org/r/1306159 (https://phabricator.wikimedia.org/T429563) [07:55:26] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306158 (owner: 10Muehlenhoff) [08:00:46] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [08:01:44] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [08:04:14] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [08:05:29] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [08:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:08] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [08:07:39] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [08:09:34] !log installing lcms2 security updates [08:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:31] o/ I'd like to deploy some private code if there's some open time [08:14:19] (03PS1) 10Marostegui: installserver: Move clouddb102[6-7] to UEFI entry [puppet] - 10https://gerrit.wikimedia.org/r/1306160 (https://phabricator.wikimedia.org/T411570) [08:14:39] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1306159 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [08:15:53] didn't hear anything so I'm starting [08:19:52] (03CR) 10Gkyziridis: [C:03+2] ml-services: Switch to float16 and reduce context length for Qwen3.6-27B deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305919 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [08:19:55] (03CR) 10MSantos: [C:03+1] Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [08:20:32] (03CR) 10Filippo Giunchedi: [C:03+2] site: put cloudvirt10[78-80] in service [puppet] - 10https://gerrit.wikimedia.org/r/1306159 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [08:22:19] (03Merged) 10jenkins-bot: ml-services: Switch to float16 and reduce context length for Qwen3.6-27B deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305919 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [08:25:32] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:26:29] (03PS2) 10Marostegui: installserver: Move clouddb102[6-7] to UEFI entry [puppet] - 10https://gerrit.wikimedia.org/r/1306160 (https://phabricator.wikimedia.org/T411570) [08:27:14] RECOVERY - Host dbproxy1028 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [08:27:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1306160 (https://phabricator.wikimedia.org/T411570) (owner: 10Marostegui) [08:29:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12064129 (10MoritzMuehlenhoff) [08:29:57] (03PS1) 10Hashar: proxy: Allow outbount HTTPS connections to port 25000 [puppet] - 10https://gerrit.wikimedia.org/r/1306161 (https://phabricator.wikimedia.org/T430479) [08:32:17] !log installing libxslt bugfix updates [08:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:31] (03CR) 10Marostegui: [C:03+2] installserver: Move clouddb102[6-7] to UEFI entry [puppet] - 10https://gerrit.wikimedia.org/r/1306160 (https://phabricator.wikimedia.org/T411570) (owner: 10Marostegui) [08:33:21] (03CR) 10Ozge: [C:03+2] ml-services: Deploy latest version of revertrisk-wikidata. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305889 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [08:33:48] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudvirt1078.eqiad.wmnet [08:34:05] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudvirt1079.eqiad.wmnet [08:34:13] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudvirt1080.eqiad.wmnet [08:34:24] (03CR) 10Ozge: [V:03+2 C:03+2] ml-services: Deploy latest version of revertrisk-wikidata. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305889 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [08:34:50] (03PS2) 10Hashar: proxy: Allow outbount HTTPS connections to port 25000 [puppet] - 10https://gerrit.wikimedia.org/r/1306161 (https://phabricator.wikimedia.org/T430479) [08:35:32] (03Merged) 10jenkins-bot: ml-services: Deploy latest version of revertrisk-wikidata. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305889 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [08:36:54] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:37:38] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:40:30] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:43:57] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1078.eqiad.wmnet [08:44:34] (03PS1) 10Jelto: Update calico-crds to calico v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306165 (https://phabricator.wikimedia.org/T427400) [08:46:40] (03CR) 10CI reject: [V:04-1] Update calico-crds to calico v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306165 (https://phabricator.wikimedia.org/T427400) (owner: 10Jelto) [08:47:17] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1079.eqiad.wmnet [08:47:40] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 10iPoid-Service (IPoid OpenSearch), 06Product Safety and Integrity (Sprint 2026 (Jun 29 - Jul 17)): "IPoid request failed for IP" logs since 2026-06-25 - https://phabricator.wikimedia.org/T430484#12064282 (10BTullis) a:03BTullis [08:48:10] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 10iPoid-Service (IPoid OpenSearch), 06Product Safety and Integrity (Sprint 2026 (Jun 29 - Jul 17)): "IPoid request failed for IP" logs since 2026-06-25 - https://phabricator.wikimedia.org/T430484#12064288 (10BTullis) p:05Triage→03High I'm looking in... [08:49:37] (03PS1) 10Arnaudb: backup: exclude gerrit caches [puppet] - 10https://gerrit.wikimedia.org/r/1306164 (https://phabricator.wikimedia.org/T411583) [08:50:00] (03PS1) 10Arnaudb: backups: exclude lucene index from Gerrit backups [puppet] - 10https://gerrit.wikimedia.org/r/1306166 (https://phabricator.wikimedia.org/T411583) [08:50:13] (03PS2) 10Arnaudb: backup: exclude lucene index from Gerrit backups [puppet] - 10https://gerrit.wikimedia.org/r/1306166 (https://phabricator.wikimedia.org/T411583) [08:51:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:51:36] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [08:51:42] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [08:51:43] !log Deployed patch for T427287 [08:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:53] 06SRE, 06Infrastructure-Foundations, 10netops: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242#12064314 (10ayounsi) 05Resolved→03Open Alerts are back for eqiad C/D (+spines) - https://alerts.wikimedia.org/?q=scope%3Dnetwork&q=alertna... [08:51:56] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1249: Upgrading db1249.eqiad.wmnet [08:52:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1249: Upgrading db1249.eqiad.wmnet [08:54:46] done [08:54:51] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1249.eqiad.wmnet with OS trixie [08:54:58] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:54:58] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [08:55:20] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2247: Upgrading db2247.codfw.wmnet [08:55:42] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2247: Upgrading db2247.codfw.wmnet [08:58:49] !log btullis@puppetserver1001 conftool action : set/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-wdqs2003.codfw.wmnet [08:59:12] !log btullis@puppetserver1001 conftool action : set/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-test-wdqs2001.codfw.wmnet [08:59:30] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1080.eqiad.wmnet [08:59:42] !log btullis@puppetserver1001 conftool action : set/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-test-wdqs2001.codfw.wmnet [08:59:52] cwilliams@cumin1003 major-upgrade (PID 3992688) is awaiting input [09:00:13] !log btullis@puppetserver1001 conftool action : set/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-wdqs-test2001.codfw.wmnet [09:04:33] (03CR) 10Elukey: "recheck" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298657 (owner: 10Volans) [09:09:26] !log ihurbain@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [09:09:54] (03Merged) 10jenkins-bot: config: raise on missing INI file when raises=True [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298657 (owner: 10Volans) [09:10:02] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2247.codfw.wmnet with OS trixie [09:12:15] (03PS2) 10Jelto: Update calico-crds to calico v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306165 (https://phabricator.wikimedia.org/T427400) [09:12:23] !log ihurbain@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [09:12:24] !log ihurbain@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [09:12:44] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1249.eqiad.wmnet with reason: host reimage [09:13:03] !log ihurbain@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [09:14:22] (03CR) 10CI reject: [V:04-1] Update calico-crds to calico v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306165 (https://phabricator.wikimedia.org/T427400) (owner: 10Jelto) [09:14:42] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 10iPoid-Service (IPoid OpenSearch), 06Product Safety and Integrity (Sprint 2026 (Jun 29 - Jul 17)): "IPoid request failed for IP" logs since 2026-06-25 - https://phabricator.wikimedia.org/T430484#12064391 (10atsuko) We got ProbeDown alert on 27 Jun 2026... [09:15:57] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [09:16:13] (03PS3) 10Jelto: Update calico-crds to calico v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306165 (https://phabricator.wikimedia.org/T427400) [09:18:03] (03CR) 10CI reject: [V:04-1] Update calico-crds to calico v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306165 (https://phabricator.wikimedia.org/T427400) (owner: 10Jelto) [09:18:12] (03PS1) 10Gkyziridis: ml-services: Remove qwen completely from experiental. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306172 (https://phabricator.wikimedia.org/T425680) [09:18:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1295021 (owner: 10Elukey) [09:19:00] (03PS4) 10Jelto: Update calico-crds to calico v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306165 (https://phabricator.wikimedia.org/T427400) [09:19:13] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1249.eqiad.wmnet with reason: host reimage [09:19:43] !log installing librabbitmq security updates [09:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:21] (03PS1) 10Marostegui: WIP: master.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1306173 (https://phabricator.wikimedia.org/T430488) [09:22:00] !log T418494 delete apiportalwiki cirrussearch indices [09:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:05] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [09:22:08] (03CR) 10Elukey: [C:03+1] "LGTM, we can then do the kafka mirror cleanup! Just to double check the commit msg - the fact that we'll declare the alerts on each kafka " [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [09:22:33] (03CR) 10Elukey: [C:03+2] __init__: fail clearly when unknown __version__ [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298658 (owner: 10Volans) [09:23:08] (03CR) 10Elukey: [C:03+2] phabricator: reject trailing newline in task ID [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298659 (owner: 10Volans) [09:23:32] (03CR) 10Elukey: [C:03+2] dns: resolve() instead of deprecated query() [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298660 (owner: 10Volans) [09:23:46] (03CR) 10Elukey: [C:03+2] actions: fix ActionsDict docstring example output [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298661 (owner: 10Volans) [09:24:00] (03CR) 10Elukey: [C:03+2] interactive: fix ask_input Returns docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298662 (owner: 10Volans) [09:24:44] (03CR) 10Elukey: [C:03+2] interactive: improve error message with validators [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298663 (owner: 10Volans) [09:24:46] (03PS2) 10Marostegui: WIP: master.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1306173 (https://phabricator.wikimedia.org/T430488) [09:25:01] (03CR) 10Elukey: [C:03+2] irc: set the handler level via setLevel() [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298664 (owner: 10Volans) [09:25:09] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306173 (https://phabricator.wikimedia.org/T430488) (owner: 10Marostegui) [09:25:41] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2247.codfw.wmnet with reason: host reimage [09:29:29] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2247.codfw.wmnet with reason: host reimage [09:32:26] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Repurpose ganeti102[3456] for Zuul migration - https://phabricator.wikimedia.org/T427353#12064431 (10LSobanski) p:05Triage→03Medium [09:35:54] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1249.eqiad.wmnet with OS trixie [09:36:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:36:36] !log drop database apiportalwiki T418494 [09:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:41] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [09:37:02] RESOLVED: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [09:37:19] (03CR) 10Elukey: "Left some comments but it is basically ready to go, feel free to merge after reviewing/following-up on those!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) (owner: 10JHathaway) [09:37:54] (03PS3) 10JHathaway: durable: fix test when run in a tmux [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304619 [09:38:18] (03CR) 10Elukey: [C:03+1] durable: fix test when run in a tmux [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304619 (owner: 10JHathaway) [09:39:01] (03CR) 10Elukey: [C:03+1] Link to cookbook doc [cookbooks] - 10https://gerrit.wikimedia.org/r/1301339 (owner: 10Federico Ceratto) [09:39:42] !log btullis@puppetserver1001 conftool action : set/pooled=yes; selector: service=kubesvc,cluster=dse-k8s,dc=codfw [09:40:02] (03CR) 10Elukey: [C:03+1] weak etag comments (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 (owner: 10JHathaway) [09:41:50] (03PS3) 10Gmodena: WIP: airflow-wikidata: add qlever index PVCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305854 (https://phabricator.wikimedia.org/T428235) [09:42:58] (03PS2) 10Gmodena: WIP: admin_ng: add wdqs local-storage resources for qlever indexer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305852 (https://phabricator.wikimedia.org/T428235) [09:42:58] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:43:04] (03CR) 10CI reject: [V:04-1] WIP: airflow-wikidata: add qlever index PVCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305854 (https://phabricator.wikimedia.org/T428235) (owner: 10Gmodena) [09:43:44] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:43:58] (03CR) 10Clément Goubert: extension-list: Remove WikimediaApiPortalOAuth ext and WikimediaApiPortal skin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305804 (https://phabricator.wikimedia.org/T429373) (owner: 10Krinkle) [09:44:07] (03CR) 10Clément Goubert: [C:03+1] Remove remaining occurences of apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305999 (https://phabricator.wikimedia.org/T418494) (owner: 10Zabe) [09:46:10] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): dse-k8s-codfw istiogateway misconfiguration - https://phabricator.wikimedia.org/T430504 (10atsuko) 03NEW [09:46:28] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2247.codfw.wmnet with OS trixie [09:47:07] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): dse-k8s-codfw istiogateway misconfiguration - https://phabricator.wikimedia.org/T430504#12064544 (10atsuko) [09:47:09] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 10iPoid-Service (IPoid OpenSearch), 06Product Safety and Integrity (Sprint 2026 (Jun 29 - Jul 17)): "IPoid request failed for IP" logs since 2026-06-25 - https://phabricator.wikimedia.org/T430484#12064543 (10atsuko) [09:47:16] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): dse-k8s-codfw istiogateway misconfiguration - https://phabricator.wikimedia.org/T430504#12064546 (10atsuko) [09:49:16] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 10iPoid-Service (IPoid OpenSearch), 06Product Safety and Integrity (Sprint 2026 (Jun 29 - Jul 17)): "IPoid request failed for IP" logs since 2026-06-25 - https://phabricator.wikimedia.org/T430484#12064550 (10atsuko) 05Open→03Resolved Incident sh... [09:49:24] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1249: Migration of db1249.eqiad.wmnet completed [09:49:32] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 10iPoid-Service (IPoid OpenSearch), 06Product Safety and Integrity (Sprint 2026 (Jun 29 - Jul 17)): "IPoid request failed for IP" logs since 2026-06-25 - https://phabricator.wikimedia.org/T430484#12064554 (10BTullis) We discovered that the istio-ing... [09:51:39] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [09:54:23] !log installing libssh2 security updates [09:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2003 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T1000) [10:01:27] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): dse-k8s-codfw istiogateway misconfiguration - https://phabricator.wikimedia.org/T430504#12064604 (10atsuko) [10:01:49] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12064605 (10Monrac5) >>! In T430304#12063799, @fgiunchedi wrote: > Hello @Monrac5, thank you for reaching out -- just to confirm: you are not part of WMDE staff, correct ? He... [10:02:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2247: Migration of db2247.codfw.wmnet completed [10:05:43] (03CR) 10Elukey: "All use cases are sound, I left a comment about only one to be sure about what you are doing. Looks good modulo the CI failures and the "c" [puppet] - 10https://gerrit.wikimedia.org/r/1305983 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [10:07:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12064630 (10cmooney) p:05Medium→03High [10:08:51] (03PS1) 10Muehlenhoff: Apply builder role to build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1306204 (https://phabricator.wikimedia.org/T417389) [10:09:02] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): dse-k8s-codfw istiogateway misconfiguration - https://phabricator.wikimedia.org/T430504#12064645 (10atsuko) [10:09:53] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): dse-k8s-codfw istiogateway misconfiguration - https://phabricator.wikimedia.org/T430504#12064660 (10atsuko) [10:11:39] RESOLVED: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:14:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306204 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [10:14:42] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:15:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[3456] - https://phabricator.wikimedia.org/T419892#12064678 (10fgiunchedi) Would we have capacity (power, space) to move two hosts to their final allocation in E4/F4 ? [10:17:15] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [10:18:24] (03CR) 10Muehlenhoff: [C:03+2] Apply builder role to build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1306204 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [10:19:42] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:20:25] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12064707 (10MoritzMuehlenhoff) [10:23:22] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:24:59] (03PS1) 10Chlod Alejandro: Revert "nlwiki: change to Wikipedia 25 logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306210 [10:25:06] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): dse-k8s-codfw istiogateway misconfiguration - https://phabricator.wikimedia.org/T430504#12064746 (10atsuko) [10:25:37] (03Abandoned) 10Marostegui: WIP: master.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1306173 (https://phabricator.wikimedia.org/T430488) (owner: 10Marostegui) [10:25:43] (03PS2) 10Chlod Alejandro: Revert "nlwiki: change to Wikipedia 25 logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306210 (https://phabricator.wikimedia.org/T424519) [10:26:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306210 (https://phabricator.wikimedia.org/T424519) (owner: 10Chlod Alejandro) [10:28:45] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:29:47] 06SRE, 06collaboration-services: cumin2003 fails to connect to contint[12]003 - https://phabricator.wikimedia.org/T430510 (10MoritzMuehlenhoff) 03NEW [10:30:16] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12064793 (10MoritzMuehlenhoff) [10:34:55] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1249: Migration of db1249.eqiad.wmnet completed [10:34:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:35:01] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [10:41:54] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1006.eqiad.wmnet with OS trixie [10:42:32] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator: Add logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904#12064836 (10MoritzMuehlenhoff) >>! In T286904#12041211, @Aklapper wrote: > Hi, I myself am not sure what to add apart from T286904#11091482. Please elaborate if any... [10:42:33] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Remove qwen completely from experiental. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306172 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:43:14] (03CR) 10Gkyziridis: [C:03+2] ml-services: Remove qwen completely from experiental. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306172 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:45:41] (03Merged) 10jenkins-bot: ml-services: Remove qwen completely from experiental. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306172 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:47:34] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:48:06] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2247: Migration of db2247.codfw.wmnet completed [10:48:07] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:48:28] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12064872 (10karapayneWMDE) @fgiunchedi - WMDE EM here! Mona has been an intern here and will be working on wmde analytics topics in a volunteer capacity after their internship... [10:48:40] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [10:51:50] (03PS1) 10AikoChou: ml-services: bump event-emitting isvc image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306218 (https://phabricator.wikimedia.org/T421237) [10:54:41] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1006.eqiad.wmnet with OS trixie [10:55:13] (03PS1) 10Muehlenhoff: Rebuild for trixie [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1306219 (https://phabricator.wikimedia.org/T417389) [10:55:29] (03PS1) 10Gkyziridis: ml-services: increase helmfile timeout and redeploy qwen36-27b in float16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306220 (https://phabricator.wikimedia.org/T425680) [10:55:58] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [10:57:34] (03PS2) 10AikoChou: ml-services: bump event-emitting isvc image tags in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306218 (https://phabricator.wikimedia.org/T421237) [11:00:59] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: Add LiftWingLLM rate limit policy for LLM endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [11:01:29] (03PS1) 10Mszwarc: Temporarily change plwiki tagline for 1.7M articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306221 (https://phabricator.wikimedia.org/T430512) [11:02:59] (03PS2) 10Mszwarc: Temporarily change plwiki tagline for 1.7M articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306221 (https://phabricator.wikimedia.org/T430512) [11:03:02] (03CR) 10CI reject: [V:04-1] Rebuild for trixie [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1306219 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [11:04:15] (03CR) 10Federico Ceratto: [C:03+2] Link to cookbook doc [cookbooks] - 10https://gerrit.wikimedia.org/r/1301339 (owner: 10Federico Ceratto) [11:05:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306221 (https://phabricator.wikimedia.org/T430512) (owner: 10Mszwarc) [11:07:02] (03PS2) 10Btullis: topolvm-crds: add the TopoLVM CRD for version 0.38.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305976 (https://phabricator.wikimedia.org/T429331) [11:07:03] (03PS2) 10Btullis: admin_ng: enable the topolvm CSI driver on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305978 (https://phabricator.wikimedia.org/T429331) [11:07:03] (03PS1) 10Btullis: topolvm: scrape controller and node metrics via prometheus.io annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306222 (https://phabricator.wikimedia.org/T429331) [11:07:05] (03PS1) 10Btullis: admin_ng: define the topolvm CSI releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306223 (https://phabricator.wikimedia.org/T429331) [11:07:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:13:06] (03PS2) 10Clément Goubert: tls_terminator: Ratelimit accounting and upstream [puppet] - 10https://gerrit.wikimedia.org/r/1305079 (https://phabricator.wikimedia.org/T414440) [11:13:45] !log cmooney@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1006.eqiad.wmnet with reason: host reimage [11:14:50] (03PS3) 10Clément Goubert: tls_terminator: Ratelimit accounting and upstream [puppet] - 10https://gerrit.wikimedia.org/r/1305079 (https://phabricator.wikimedia.org/T414440) [11:14:59] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: emit 401 if rate limit is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298031 (https://phabricator.wikimedia.org/T428184) (owner: 10Daniel Kinzler) [11:16:43] (03CR) 10JMeybohm: [C:03+1] tls_terminator: Ratelimit accounting and upstream [puppet] - 10https://gerrit.wikimedia.org/r/1305079 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [11:17:23] (03PS2) 10Btullis: topolvm: import the upstream chart version 15.7.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305973 (https://phabricator.wikimedia.org/T429331) [11:17:23] (03PS2) 10Btullis: topolvm: customise the imported chart for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305974 (https://phabricator.wikimedia.org/T429331) [11:17:23] (03PS2) 10Btullis: topolvm: tighten controller RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305975 (https://phabricator.wikimedia.org/T429331) [11:17:23] (03PS2) 10Btullis: topolvm: scrape controller/node metrics via prometheus.io annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306222 (https://phabricator.wikimedia.org/T429331) [11:17:24] (03PS3) 10Btullis: topolvm-crds: add the TopoLVM CRD for version 0.38.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305976 (https://phabricator.wikimedia.org/T429331) [11:17:26] (03PS2) 10Btullis: admin_ng: define the topolvm CSI releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306223 (https://phabricator.wikimedia.org/T429331) [11:17:30] (03PS3) 10Btullis: admin_ng: enable the topolvm CSI driver on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305978 (https://phabricator.wikimedia.org/T429331) [11:19:27] (03CR) 10Clément Goubert: [C:03+2] tls_terminator: Ratelimit accounting and upstream [puppet] - 10https://gerrit.wikimedia.org/r/1305079 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [11:19:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1006.eqiad.wmnet with reason: host reimage [11:26:48] !log cgoubert@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:27:22] !log cgoubert@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:27:37] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:28:37] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:28:59] (03CR) 10Kevin Bazira: [C:03+1] ml-services: increase helmfile timeout and redeploy qwen36-27b in float16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306220 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [11:29:16] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:30:12] (03CR) 10Gkyziridis: [C:03+2] ml-services: increase helmfile timeout and redeploy qwen36-27b in float16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306220 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [11:30:35] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:30:57] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:31:33] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:32:36] !log cgoubert@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [11:33:34] !log cgoubert@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:34:06] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:36:15] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:36:40] (03Merged) 10jenkins-bot: ml-services: increase helmfile timeout and redeploy qwen36-27b in float16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306220 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [11:36:44] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:36:44] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [11:36:51] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [11:36:59] !log cgoubert@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:37:04] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1252: Upgrading db1252.eqiad.wmnet [11:37:35] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1252: Upgrading db1252.eqiad.wmnet [11:38:10] !log cgoubert@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:38:17] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1006.eqiad.wmnet with OS trixie [11:38:31] !log cgoubert@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:39:10] !log cgoubert@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:39:38] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:40:46] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:40:51] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1252.eqiad.wmnet with OS trixie [11:41:11] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:43:06] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [11:46:26] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for private1-d8test-eqiad vlan - cmooney@cumin1003" [11:46:30] (03PS1) 10Cathal Mooney: Add include statement for 2620:0:861:167::/64 PTR records [dns] - 10https://gerrit.wikimedia.org/r/1306229 [11:47:15] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for private1-d8test-eqiad vlan - cmooney@cumin1003" [11:47:15] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:47:24] (03CR) 10CI reject: [V:04-1] Add include statement for 2620:0:861:167::/64 PTR records [dns] - 10https://gerrit.wikimedia.org/r/1306229 (owner: 10Cathal Mooney) [11:50:14] (03PS2) 10Cathal Mooney: Add include statement for 2620:0:861:167::/64 PTR records [dns] - 10https://gerrit.wikimedia.org/r/1306229 [11:51:52] (03CR) 10Cathal Mooney: [C:03+2] Add include statement for 2620:0:861:167::/64 PTR records [dns] - 10https://gerrit.wikimedia.org/r/1306229 (owner: 10Cathal Mooney) [11:53:03] !log cmooney@dns3003 START - running authdns-update [11:54:39] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply [11:55:05] !log cmooney@dns3003 END - running authdns-update [11:57:02] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: apply} [11:57:18] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [11:57:47] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1252.eqiad.wmnet with reason: host reimage [11:58:36] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [11:58:42] (03PS1) 10Krinkle: varnish: Add edge fixup for corrupt upload.wm.o urls from mobileapps [puppet] - 10https://gerrit.wikimedia.org/r/1306230 [12:00:57] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [12:01:29] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [12:02:09] (03PS1) 10Muehlenhoff: package_builder: Pass a keyfile for the deb-src apt source [puppet] - 10https://gerrit.wikimedia.org/r/1306231 (https://phabricator.wikimedia.org/T417389) [12:04:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1252.eqiad.wmnet with reason: host reimage [12:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:39] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:08:09] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [12:08:45] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [12:09:58] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [12:10:22] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [12:10:28] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [12:10:48] FIRING: PuppetFailure: Puppet has failed on build2004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:11:01] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [12:11:40] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [12:12:11] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [12:12:47] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [12:13:28] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [12:15:27] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on wikikube-worker2315:9290 - https://phabricator.wikimedia.org/T430220#12065083 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [12:16:22] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [12:17:01] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [12:17:09] (03CR) 10Ozge: [C:03+1] ml-services: bump event-emitting isvc image tags in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306218 (https://phabricator.wikimedia.org/T421237) (owner: 10AikoChou) [12:17:33] (03CR) 10JMeybohm: [C:03+1] "LGTM, but please only merge after changelog and chart review" [debs/calico] (v3.30) - 10https://gerrit.wikimedia.org/r/1305139 (https://phabricator.wikimedia.org/T427400) (owner: 10Jelto) [12:18:33] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [12:19:06] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [12:20:48] RESOLVED: PuppetFailure: Puppet has failed on build2004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:21:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1252.eqiad.wmnet with OS trixie [12:22:24] (03PS1) 10Muehlenhoff: build2004: Enable profile::docker::builder::docker_pkg [puppet] - 10https://gerrit.wikimedia.org/r/1306245 (https://phabricator.wikimedia.org/T417389) [12:23:51] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305730 (owner: 10PipelineBot) [12:25:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306245 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [12:26:14] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305730 (owner: 10PipelineBot) [12:26:58] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302798 (owner: 10PipelineBot) [12:27:06] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288138 (owner: 10PipelineBot) [12:27:10] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282779 (owner: 10PipelineBot) [12:27:16] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277075 (owner: 10PipelineBot) [12:27:17] (03PS2) 10Krinkle: varnish: Add edge fixup for corrupt upload.wm.o urls from mobileapps [puppet] - 10https://gerrit.wikimedia.org/r/1306230 [12:28:09] !log add cloudvirt10[78-80] to nova -- with compute disabled - T429563 [12:28:12] (03CR) 10Elukey: [C:03+1] package_builder: Pass a keyfile for the deb-src apt source [puppet] - 10https://gerrit.wikimedia.org/r/1306231 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [12:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:14] T429563: Put cloudvirt10[77-80] in service - https://phabricator.wikimedia.org/T429563 [12:29:25] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:29:48] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:32:25] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:32:55] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:33:04] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:33:35] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:34:11] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [12:34:23] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config: apply [12:34:38] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [12:35:26] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [12:36:26] (03CR) 10Bartosz Wójtowicz: [C:03+2] rest-gateway: Add LiftWingLLM rate limit policy for LLM endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [12:37:01] (03CR) 10Ozge: [C:03+2] ml-services: Bump revscoring staging images to 2026-06-23-094330-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305384 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [12:37:11] (03CR) 10Ozge: [V:03+2 C:03+2] ml-services: Bump revscoring staging images to 2026-06-23-094330-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305384 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [12:37:18] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1252: Migration of db1252.eqiad.wmnet completed [12:37:38] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:38:52] (03Merged) 10jenkins-bot: rest-gateway: Add LiftWingLLM rate limit policy for LLM endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [12:40:17] (03CR) 10AikoChou: [C:03+2] ml-services: bump event-emitting isvc image tags in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306218 (https://phabricator.wikimedia.org/T421237) (owner: 10AikoChou) [12:40:21] (03Merged) 10jenkins-bot: ml-services: Bump revscoring staging images to 2026-06-23-094330-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305384 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [12:41:12] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [12:41:16] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [12:42:02] PROBLEM - Host asw1-b13-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:42:02] PROBLEM - Host asw1-b12-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:42:24] PROBLEM - Router interfaces on mr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.130, interfaces up: 32, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:42:48] (03Merged) 10jenkins-bot: ml-services: bump event-emitting isvc image tags in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306218 (https://phabricator.wikimedia.org/T421237) (owner: 10AikoChou) [12:42:56] PROBLEM - Host scs-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [12:42:56] PROBLEM - Host ps1-b12-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [12:42:56] PROBLEM - Host ps1-b13-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [12:44:26] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [12:44:30] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [12:44:42] FIRING: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [12:44:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:fxp0 (Core: msw1-b12-drmrs:3 {#D0062a}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:44:54] PROBLEM - Host cr1-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:45:08] PROBLEM - Host cr2-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:46:07] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [12:46:11] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [12:47:23] FIRING: [16x] CertAlmostExpired: gNMI TLS certificate for asw1-b12-drmrs.mgmt.drmrs.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:48:08] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1306246 (owner: 10L10n-bot) [12:49:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:50:07] (03PS3) 10Btullis: topolvm: customise the imported chart for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305974 (https://phabricator.wikimedia.org/T429331) [12:50:07] (03PS3) 10Btullis: topolvm: tighten controller RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305975 (https://phabricator.wikimedia.org/T429331) [12:50:07] (03PS3) 10Btullis: topolvm: scrape controller/node metrics via prometheus.io annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306222 (https://phabricator.wikimedia.org/T429331) [12:50:07] (03PS4) 10Btullis: topolvm-crds: add the TopoLVM CRD for version 0.38.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305976 (https://phabricator.wikimedia.org/T429331) [12:50:08] (03PS3) 10Btullis: admin_ng: define the topolvm CSI releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306223 (https://phabricator.wikimedia.org/T429331) [12:50:09] (03PS4) 10Btullis: admin_ng: enable the topolvm CSI driver on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305978 (https://phabricator.wikimedia.org/T429331) [12:51:05] o/ sorry going to sneak in another private code deploy to stop some logspam [12:54:17] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306262 [12:55:06] 06SRE, 06Infrastructure-Foundations, 10netops: SR-Linux: applying analytics-in acl to irb sub-interface blocks ARP - https://phabricator.wikimedia.org/T429499#12065297 (10cmooney) FWIW this behaviour is not evident on SR-Linux v25.10.1 (tested on lswtest-d8-eqiad). It seems to be yet another 24.x SR-Linux b... [12:59:17] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12065322 (10MoritzMuehlenhoff) p:05Triage→03Low [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T1300). nyaa~ [13:00:05] VadymTS1: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:33] o\ [13:00:38] I need a deployer [13:00:45] o/ Hi sorry I'm currently running a scap and will be finishing up soon [13:00:49] (03CR) 10Muehlenhoff: [C:03+2] package_builder: Pass a keyfile for the deb-src apt source [puppet] - 10https://gerrit.wikimedia.org/r/1306231 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [13:01:04] (03CR) 10Tiziano Fogli: "Yes, all the kafka nodes export the same Prometheus rules, which will be deduplicated during the import phase on the Prometheus hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:17] !log dpogorzelski@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: sync [13:04:30] !log dpogorzelski@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: sync [13:05:11] !log dpogorzelski@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: sync [13:05:24] !log dpogorzelski@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: sync [13:05:57] rolled my scap back since something went odd during testing [13:06:22] 06SRE, 06Content-Transform-Team, 10Maps, 06Traffic, 07affects-Kiwix-and-openZIM: Wikipedia wikis have broken maps URLs in infobox: "Bad GeoJSON - unknown \"type\" property \"ExternalData\"" - https://phabricator.wikimedia.org/T424046#12065358 (10Jgiannelos) I debugged this example: > Article URL: https:... [13:06:26] RECOVERY - Host asw1-b12-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.62 ms [13:06:26] RECOVERY - Host asw1-b13-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.66 ms [13:06:28] RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:06:28] RECOVERY - Host ps1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 88.12 ms [13:06:28] RECOVERY - Host ps1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 88.46 ms [13:06:46] (03PS1) 10JavierMonton: stream: pageview-trending-relative-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306263 (https://phabricator.wikimedia.org/T430134) [13:07:23] FIRING: [16x] CertAlmostExpired: gNMI TLS certificate for asw1-b12-drmrs.mgmt.drmrs.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:07:29] VadymTS1: I can spiderpig for you if you can test your changes? [13:07:43] Yes I can test [13:08:46] RECOVERY - Host scs-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.92 ms [13:09:02] k both of them look like small config changes that can go at the same time so I'm going to do that [13:09:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306056 (https://phabricator.wikimedia.org/T430416) (owner: 10VadymTS1) [13:09:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305811 (https://phabricator.wikimedia.org/T430182) (owner: 10VadymTS1) [13:09:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:09:42] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:fxp0 (Core: msw1-b12-drmrs:3 {#D0062a}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:10:24] (03CR) 10Ottomata: [C:03+1] [eventgate-*] Bump to v1.31.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305900 (https://phabricator.wikimedia.org/T415590) (owner: 10TChin) [13:10:44] RECOVERY - Host cr1-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.60 ms [13:10:44] (03Merged) 10jenkins-bot: User groups changes for English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306056 (https://phabricator.wikimedia.org/T430416) (owner: 10VadymTS1) [13:10:48] (03Merged) 10jenkins-bot: hrwiki: Add to wgCiteResponsiveReferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305811 (https://phabricator.wikimedia.org/T430182) (owner: 10VadymTS1) [13:10:56] RECOVERY - Host cr2-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.68 ms [13:11:19] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1306056|User groups changes for English Wikiversity (T430416)]], [[gerrit:1305811|hrwiki: Add to wgCiteResponsiveReferences (T430182)]] [13:11:26] T430416: User group changes for English Wikiversity - https://phabricator.wikimedia.org/T430416 [13:11:27] T430182: Convert reference lists over to `responsive` on hrwiki - https://phabricator.wikimedia.org/T430182 [13:13:15] !log stran@deploy1003 stran, vadymts1: Backport for [[gerrit:1306056|User groups changes for English Wikiversity (T430416)]], [[gerrit:1305811|hrwiki: Add to wgCiteResponsiveReferences (T430182)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:13:23] Testing [13:14:30] (03CR) 10Tiziano Fogli: "If I understood correctly, credentials could be passed through a config file without the need to have them in the command line." [puppet] - 10https://gerrit.wikimedia.org/r/1305718 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [13:14:42] RESOLVED: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [13:14:54] (03PS1) 10Muehlenhoff: package_builder: Also specify apt key for three other source sources [puppet] - 10https://gerrit.wikimedia.org/r/1306271 (https://phabricator.wikimedia.org/T417389) [13:15:32] (03CR) 10CI reject: [V:04-1] package_builder: Also specify apt key for three other source sources [puppet] - 10https://gerrit.wikimedia.org/r/1306271 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [13:15:48] Tran Alls good [13:15:53] continuing [13:15:55] !log stran@deploy1003 stran, vadymts1: Continuing with deployment [13:17:03] (03PS2) 10Muehlenhoff: package_builder: Also specify apt key for three other source sources [puppet] - 10https://gerrit.wikimedia.org/r/1306271 (https://phabricator.wikimedia.org/T417389) [13:18:26] (03PS1) 10Bartosz Wójtowicz: ml-services: Move qwen3-14b to llm namespace. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306272 (https://phabricator.wikimedia.org/T426749) [13:19:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frqueue1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429520#12065439 (10Jclark-ctr) 05Open→03Resolved [13:19:43] (03CR) 10Andrew Bogott: cloud-vps backups: exclude CI runner nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [13:20:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:20:30] (03CR) 10JHathaway: [C:03+2] durable: fix test when run in a tmux [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304619 (owner: 10JHathaway) [13:20:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frmx1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429529#12065453 (10Jclark-ctr) 05Open→03Resolved [13:21:10] (03CR) 10Joal: [C:03+1] "I have not checked the details, but the structure looks good to me here! Thanks @snwachukwu@wikimedia.org :)" [puppet] - 10https://gerrit.wikimedia.org/r/1303460 (https://phabricator.wikimedia.org/T425385) (owner: 10Snwachukwu) [13:21:16] (03PS7) 10Andrew Bogott: cloud-vps backups: exclude CI runner nodes [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) [13:21:16] (03PS3) 10Andrew Bogott: cloud-vps backups: Resume backups for all deployment-prep hosts [puppet] - 10https://gerrit.wikimedia.org/r/1305968 (https://phabricator.wikimedia.org/T430018) [13:21:16] (03PS3) 10Andrew Bogott: cloud-vps backups: exclude puppet-diff worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1305969 (https://phabricator.wikimedia.org/T430018) [13:21:16] (03PS3) 10Andrew Bogott: cloud-vps backups: exclude xtools worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1305970 (https://phabricator.wikimedia.org/T430018) [13:22:32] oh my rollback is causing problems on canary. Hm...sorry give me a minute VadymTS1>. [13:22:41] okay [13:22:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1252: Migration of db1252.eqiad.wmnet completed [13:22:50] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [13:22:58] (03CR) 10AOkoth: "The host is already gone unfortunately... Noted for next decom." [puppet] - 10https://gerrit.wikimedia.org/r/1305661 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [13:23:08] (03PS2) 10AOkoth: site: remove phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/1305661 (https://phabricator.wikimedia.org/T423727) [13:25:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:25:56] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps backups: exclude CI runner nodes [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [13:26:13] (03PS4) 10Andrew Bogott: cloud-vps backups: Resume backups for all deployment-prep hosts [puppet] - 10https://gerrit.wikimedia.org/r/1305968 (https://phabricator.wikimedia.org/T430018) [13:26:29] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps backups: Resume backups for all deployment-prep hosts [puppet] - 10https://gerrit.wikimedia.org/r/1305968 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [13:26:40] (03CR) 10FNegri: "I'm not sure why the CI job "gate-and-submit" did not actually merge this patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [13:26:46] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps backups: exclude puppet-diff worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1305969 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [13:27:32] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:28:45] I know what's wrong, not sure if I need to run another scap on my private code to apply the fix but going to try to re-test on canary first and see if it resolves as a side effect. Otherwise will have to deploy private code changes first before these configs. [13:30:15] Tran I need to go; will you be able to finish the scap-sync without me? [13:30:25] Yes, sorry I'll handle deploying it [13:30:28] (03CR) 10Snwachukwu: "Thanks @btullis@wikimedia.org. I totally agree moving sqoop to Airflow is the best option." [puppet] - 10https://gerrit.wikimedia.org/r/1303460 (https://phabricator.wikimedia.org/T425385) (owner: 10Snwachukwu) [13:30:39] https://wikitech.wikimedia.org/ on mwdebug is broken (as well as intermittent on prod). could it be linked to current release? https://logstash.wikimedia.org/goto/fece8768bde3968ef8697e2465735313 [13:30:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306271 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [13:31:26] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:31:34] exiting without rollback to deploy private code fixes. scap should sync config as a side effect. [13:31:43] (03CR) 10Snwachukwu: "Thanks @joal@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1303460 (https://phabricator.wikimedia.org/T425385) (owner: 10Snwachukwu) [13:31:53] !log stran@deploy1003 Scap cancelled without rolling back. [13:33:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:33:54] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:34:31] !ack [13:34:32] 8104 (ACKED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [13:34:35] !incidents [13:34:35] 8104 (ACKED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [13:36:32] (03CR) 10Btullis: [C:03+2] Sqoop Mediawiki: Block monthly sqoop jobs on ingestion_wikis success flag. [puppet] - 10https://gerrit.wikimedia.org/r/1303460 (https://phabricator.wikimedia.org/T425385) (owner: 10Snwachukwu) [13:38:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:39:00] <_joe_> !ack [13:39:01] All incidents are already acked. [13:39:38] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:41:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12065635 (10Papaul) [13:42:21] <_joe_> Tran: can you please rollback? [13:42:39] fix is in progress [13:42:42] (03PS3) 10JHathaway: redfish: add weak etag comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 [13:42:48] as in deploying right now [13:42:55] <_joe_> we're getting hammered with errors [13:42:56] <_joe_> ah ok [13:43:01] <_joe_> let's see if this works [13:43:01] (03PS3) 10Hashar: backup: exclude lucene index from Gerrit backups [puppet] - 10https://gerrit.wikimedia.org/r/1306166 (https://phabricator.wikimedia.org/T257744) (owner: 10Arnaudb) [13:43:09] (03CR) 10JHathaway: [C:03+2] "Copied votes on follow-up patch sets have been updated:" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 (owner: 10JHathaway) [13:43:24] <_joe_> yes I see exceptions disappearing right now [13:43:51] RESOLVED: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:44:51] (03PS1) 10JMeybohm: Copy wikikube istio config to config_1.29.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306279 (https://phabricator.wikimedia.org/T427401) [13:44:53] (03PS1) 10JMeybohm: istio/main: Bump to istio 1.29.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306280 (https://phabricator.wikimedia.org/T427401) [13:45:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:45:44] _joe deploy of corrected code is done and I think charts show that it's resolved? [13:45:52] (03PS1) 10Kosta Harlan: hCaptcha: Align the loginattempt CAPTCHA with badlogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306282 (https://phabricator.wikimedia.org/T428892) [13:46:00] for context: [13:46:02] https://usercontent.irccloud-cdn.com/file/3GLsR5bc/image.png [13:46:31] these are 500x - so yes it dropped at 13:39 approx [13:47:12] I had deployed some private code to the canary servers and didn't roll it back correctly, causing the problem. Sorry 🙇 [13:48:29] (03PS1) 10AOkoth: phabricator: add multi-replica support [puppet] - 10https://gerrit.wikimedia.org/r/1306283 (https://phabricator.wikimedia.org/T377889) [13:48:36] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12065694 (10Papaul) [13:49:10] (03CR) 10CI reject: [V:04-1] phabricator: add multi-replica support [puppet] - 10https://gerrit.wikimedia.org/r/1306283 (https://phabricator.wikimedia.org/T377889) (owner: 10AOkoth) [13:49:17] !log Deployed patch for T427287 [13:49:19] (03CR) 10Jelto: [C:03+1] Copy wikikube istio config to config_1.29.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306279 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [13:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:28] (03CR) 10AOkoth: [C:03+2] site: remove phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/1305661 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [13:49:50] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:50:23] (03PS1) 10Muehlenhoff: os-reports/bullseye: task references [puppet] - 10https://gerrit.wikimedia.org/r/1306284 [13:51:09] (03PS6) 10CWilliams: Cookbook sre.mysql.upgrade should not accept multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) [13:51:14] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:51:19] (03PS1) 10Bartosz Wójtowicz: ml-services: Add qwen3-14b deployment to llm namespace. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306286 (https://phabricator.wikimedia.org/T426749) [13:51:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12065734 (10Papaul) [13:51:43] (03Abandoned) 10Bartosz Wójtowicz: ml-services: Move qwen3-14b to llm namespace. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306272 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [13:52:34] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306280 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [13:52:42] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:52:43] (03PS2) 10AOkoth: phabricator: add multi-replica support [puppet] - 10https://gerrit.wikimedia.org/r/1306283 (https://phabricator.wikimedia.org/T377889) [13:53:19] (03CR) 10Elukey: [C:03+1] build2004: Enable profile::docker::builder::docker_pkg [puppet] - 10https://gerrit.wikimedia.org/r/1306245 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [13:53:57] (03CR) 10CI reject: [V:04-1] redfish: add weak etag comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 (owner: 10JHathaway) [13:54:05] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:54:06] 06SRE, 06Infrastructure-Foundations, 10netops: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242#12065756 (10cmooney) >>! In T429242#12064314, @ayounsi wrote: > Alerts are back for eqiad C/D (+spines) - https://alerts.wikimedia.org/?q=scop... [13:54:27] (03CR) 10Muehlenhoff: [C:03+2] os-reports/bullseye: task references [puppet] - 10https://gerrit.wikimedia.org/r/1306284 (owner: 10Muehlenhoff) [13:54:55] (03CR) 10JHathaway: [C:03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 (owner: 10JHathaway) [13:55:02] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:55:55] (03CR) 10Hnowlan: [C:03+2] restbase: add disk space alert [alerts] - 10https://gerrit.wikimedia.org/r/1304852 (https://phabricator.wikimedia.org/T407141) (owner: 10Hnowlan) [13:57:08] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:57:15] (03PS1) 10Filippo Giunchedi: hieradata: add hypervisor IDs for cloudvirt10[78-80] [puppet] - 10https://gerrit.wikimedia.org/r/1306289 (https://phabricator.wikimedia.org/T429563) [13:57:52] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Align the loginattempt CAPTCHA with badlogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306282 (https://phabricator.wikimedia.org/T428892) (owner: 10Kosta Harlan) [13:58:12] (03Merged) 10jenkins-bot: restbase: add disk space alert [alerts] - 10https://gerrit.wikimedia.org/r/1304852 (https://phabricator.wikimedia.org/T407141) (owner: 10Hnowlan) [13:58:16] (03CR) 10Gmodena: WIP: admin_ng: add wdqs local-storage resources for qlever indexer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305852 (https://phabricator.wikimedia.org/T428235) (owner: 10Gmodena) [13:58:54] (03CR) 10Hashar: [C:04-1] "I have amended the commit message to link to T257744 (*Decide if Gerrit's indices should get backed up*) which was filed when QChris did " [puppet] - 10https://gerrit.wikimedia.org/r/1306166 (https://phabricator.wikimedia.org/T257744) (owner: 10Arnaudb) [14:00:56] (03Merged) 10jenkins-bot: redfish: add weak etag comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 (owner: 10JHathaway) [14:01:03] (03PS3) 10Krinkle: varnish: Add edge fixup for corrupt upload.wm.o urls from mobileapps [puppet] - 10https://gerrit.wikimedia.org/r/1306230 [14:02:51] (03CR) 10Dreamy Jazz: User groups changes for English Wikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306056 (https://phabricator.wikimedia.org/T430416) (owner: 10VadymTS1) [14:03:24] (03CR) 10Thcipriani: [C:03+1] Add Ahmon Dancy to releng-related approvals [puppet] - 10https://gerrit.wikimedia.org/r/1305566 (owner: 10Muehlenhoff) [14:03:50] (03CR) 10Thcipriani: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1305568 (owner: 10Muehlenhoff) [14:04:27] (03CR) 10VadymTS1: User groups changes for English Wikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306056 (https://phabricator.wikimedia.org/T430416) (owner: 10VadymTS1) [14:04:45] (03PS1) 10JMeybohm: Readd 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1306292 (https://phabricator.wikimedia.org/T427401) [14:05:25] jouncebot: nowandnext [14:05:25] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [14:05:25] In 0 hour(s) and 24 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T1430) [14:05:38] going to sync a config patch [14:05:54] (03CR) 10Jelto: [C:03+1] "lgtm 🎉" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1306292 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [14:05:59] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:06:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306282 (https://phabricator.wikimedia.org/T428892) (owner: 10Kosta Harlan) [14:07:02] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:07:02] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [14:07:03] (03PS1) 10VadymTS1: [config] Fix code in core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306298 [14:07:10] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [14:07:22] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1260: Upgrading db1260.eqiad.wmnet [14:07:58] (03CR) 10Jforrester: [C:03+2] "Let's try this again. Not that it hugely matters." [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306136 (owner: 10TrainBranchBot) [14:08:33] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1260: Upgrading db1260.eqiad.wmnet [14:08:47] (03CR) 10TChin: [C:03+1] stream: pageview-trending-relative-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306263 (https://phabricator.wikimedia.org/T430134) (owner: 10JavierMonton) [14:09:09] (03PS3) 10AOkoth: phabricator: add multi-replica support [puppet] - 10https://gerrit.wikimedia.org/r/1306283 (https://phabricator.wikimedia.org/T377889) [14:09:49] (03CR) 10VadymTS1: "I create code fix patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306298 (owner: 10VadymTS1) [14:10:27] (03PS4) 10AOkoth: phabricator: add multi-replica support [puppet] - 10https://gerrit.wikimedia.org/r/1306283 (https://phabricator.wikimedia.org/T377889) [14:10:36] (03Merged) 10jenkins-bot: hCaptcha: Align the loginattempt CAPTCHA with badlogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306282 (https://phabricator.wikimedia.org/T428892) (owner: 10Kosta Harlan) [14:10:41] (03CR) 10Ottomata: [C:03+1] stream: pageview-trending-relative-next (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306263 (https://phabricator.wikimedia.org/T430134) (owner: 10JavierMonton) [14:10:56] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1306282|hCaptcha: Align the loginattempt CAPTCHA with badlogin (T428892)]] [14:11:02] T428892: Cannot login: incorrectly claims wrong username and password - https://phabricator.wikimedia.org/T428892 [14:11:33] cwilliams@cumin1003 major-upgrade (PID 4033184) is awaiting input [14:11:41] (03PS1) 10Chlod Alejandro: frwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306304 (https://phabricator.wikimedia.org/T430409) [14:11:43] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:12:50] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1306282|hCaptcha: Align the loginattempt CAPTCHA with badlogin (T428892)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:13:08] (03PS1) 10Mvolz: Update translators for zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306305 (https://phabricator.wikimedia.org/T428915) [14:13:30] (03CR) 10Jcrespo: "Some thoughts, but not weighing on this patch particularly." [puppet] - 10https://gerrit.wikimedia.org/r/1306166 (https://phabricator.wikimedia.org/T257744) (owner: 10Arnaudb) [14:13:53] (03CR) 10Dreamy Jazz: [config] Fix code in core-Permissions.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306298 (owner: 10VadymTS1) [14:14:04] !log kharlan@deploy1003 kharlan: Continuing with deployment [14:14:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306136 (owner: 10TrainBranchBot) [14:14:21] (03PS1) 10Jelto: Update calico to v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306307 (https://phabricator.wikimedia.org/T427400) [14:14:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306304 (https://phabricator.wikimedia.org/T430409) (owner: 10Chlod Alejandro) [14:14:46] (03PS2) 10VadymTS1: [config] Fix code in core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306298 [14:15:00] (03CR) 10VadymTS1: [config] Fix code in core-Permissions.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306298 (owner: 10VadymTS1) [14:15:09] (03CR) 10Jcrespo: "To give additional context- maybe different parts of new gerrit can have different backup policies- e.g. if indexes are just for search, t" [puppet] - 10https://gerrit.wikimedia.org/r/1306166 (https://phabricator.wikimedia.org/T257744) (owner: 10Arnaudb) [14:15:51] cwilliams@cumin1003 major-upgrade (PID 4033184) is awaiting input [14:17:32] (03CR) 10JMeybohm: [V:03+2 C:03+2] Readd 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1306292 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [14:17:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306298 (owner: 10VadymTS1) [14:17:48] (03CR) 10Hnowlan: [C:03+1] "Yep - I have rolled out the alertmanager check as of 13:58 so I will merge this in ~20 minutes or so." [puppet] - 10https://gerrit.wikimedia.org/r/1305083 (https://phabricator.wikimedia.org/T407141) (owner: 10Tiziano Fogli) [14:18:21] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306282|hCaptcha: Align the loginattempt CAPTCHA with badlogin (T428892)]] (duration: 07m 25s) [14:18:26] T428892: Cannot login: incorrectly claims wrong username and password - https://phabricator.wikimedia.org/T428892 [14:23:46] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1306317 (https://phabricator.wikimedia.org/T430540) [14:23:53] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306318 (https://phabricator.wikimedia.org/T430540) [14:24:57] (03PS1) 10JMeybohm: Rakefile: Support multiple istio versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306319 (https://phabricator.wikimedia.org/T427401) [14:25:24] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:25:27] (03PS2) 10JMeybohm: Rakefile: Support multiple istio versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306319 (https://phabricator.wikimedia.org/T427401) [14:25:27] (03PS2) 10JMeybohm: Copy wikikube istio config to config_1.29.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306279 (https://phabricator.wikimedia.org/T427401) [14:25:27] (03PS2) 10JMeybohm: istio/main: Bump to istio 1.29.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306280 (https://phabricator.wikimedia.org/T427401) [14:26:31] jouncebot: nowandnext [14:26:31] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [14:26:31] In 0 hour(s) and 3 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T1430) [14:27:12] (03PS2) 10Chlod Alejandro: frwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306304 (https://phabricator.wikimedia.org/T430409) [14:28:43] (03CR) 10CI reject: [V:04-1] frwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306304 (https://phabricator.wikimedia.org/T430409) (owner: 10Chlod Alejandro) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T1430) [14:30:24] (03PS4) 10Btullis: topolvm: customise the imported chart for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305974 (https://phabricator.wikimedia.org/T429331) [14:30:24] (03PS4) 10Btullis: topolvm: tighten controller RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305975 (https://phabricator.wikimedia.org/T429331) [14:30:24] (03PS4) 10Btullis: topolvm: scrape controller/node metrics via prometheus.io annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306222 (https://phabricator.wikimedia.org/T429331) [14:30:24] (03PS5) 10Btullis: topolvm-crds: add the TopoLVM CRD for version 0.38.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305976 (https://phabricator.wikimedia.org/T429331) [14:30:25] (03PS4) 10Btullis: admin_ng: define the topolvm CSI releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306223 (https://phabricator.wikimedia.org/T429331) [14:30:28] (03PS5) 10Btullis: admin_ng: enable the topolvm CSI driver on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305978 (https://phabricator.wikimedia.org/T429331) [14:31:40] (03PS3) 10Chlod Alejandro: frwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306304 (https://phabricator.wikimedia.org/T430409) [14:37:54] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 07Sustainability (Incident Followup): Setup url-downloader-next.w.o to simply tests - https://phabricator.wikimedia.org/T430166#12066158 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:39:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306298 (owner: 10VadymTS1) [14:39:30] (03CR) 10Dreamy Jazz: "Thanks, syncing this now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306298 (owner: 10VadymTS1) [14:40:38] (03Merged) 10jenkins-bot: [config] Fix code in core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306298 (owner: 10VadymTS1) [14:40:56] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1306298|[config] Fix code in core-Permissions.php]] [14:41:27] (03PS2) 10Clément Goubert: redioscope: Add survey for media ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295456 (https://phabricator.wikimedia.org/T424051) [14:41:42] (03CR) 10Andrew Bogott: [C:03+1] "I think I gave you the wrong advice about how to do it but it looks right, same ID type as the other cloudvirts." [puppet] - 10https://gerrit.wikimedia.org/r/1306289 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [14:42:47] !log dreamyjazz@deploy1003 vadymts1, dreamyjazz: Backport for [[gerrit:1306298|[config] Fix code in core-Permissions.php]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:42:58] (03CR) 10Clément Goubert: redioscope: Add survey for media ratelimit (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295456 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [14:43:21] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306319 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [14:43:42] !log dreamyjazz@deploy1003 vadymts1, dreamyjazz: Continuing with deployment [14:43:57] (03CR) 10Jcrespo: "If fast recovery is needed, maybe that should be achieved with redundancy, rather than backups- a replica delayed in time, but ready to be" [puppet] - 10https://gerrit.wikimedia.org/r/1306166 (https://phabricator.wikimedia.org/T257744) (owner: 10Arnaudb) [14:44:18] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: add hypervisor IDs for cloudvirt10[78-80] [puppet] - 10https://gerrit.wikimedia.org/r/1306289 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [14:45:30] (03PS1) 10JMeybohm: Clean up istio configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306327 (https://phabricator.wikimedia.org/T427401) [14:47:15] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 06Release-Engineering-Team (Radar): [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#12066258 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving. @Joe please reopen if this is still a... [14:47:18] (03PS7) 10CWilliams: Cookbook sre.mysql.upgrade should not accept multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) [14:48:00] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306298|[config] Fix code in core-Permissions.php]] (duration: 07m 03s) [14:48:09] (03PS5) 10AOkoth: phabricator: add multi-replica support [puppet] - 10https://gerrit.wikimedia.org/r/1306283 (https://phabricator.wikimedia.org/T377889) [14:50:11] 06SRE, 10Maps, 07affects-Kiwix-and-openZIM, 06Content-Transform-Team (Work In Progress), 13Patch-For-Review: Wikipedia wikis have broken maps URLs in infobox: "Bad GeoJSON - unknown \"type\" property \"ExternalData\"" - https://phabricator.wikimedia.org/T424046#12066285 (10Jgiannelos) [14:51:27] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1306283/8798/" [puppet] - 10https://gerrit.wikimedia.org/r/1306283 (https://phabricator.wikimedia.org/T377889) (owner: 10AOkoth) [14:52:02] (03CR) 10CI reject: [V:04-1] istio/main: Bump to istio 1.29.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306280 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [14:56:30] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:58:55] (03PS2) 10JavierMonton: stream: pageview-trending-relative-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306263 (https://phabricator.wikimedia.org/T430134) [14:59:08] (03CR) 10JavierMonton: stream: pageview-trending-relative-next (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306263 (https://phabricator.wikimedia.org/T430134) (owner: 10JavierMonton) [14:59:42] 06SRE, 10Data-Persistence-Backup, 06Infrastructure-Foundations, 07Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664#12066382 (10LSobanski) @jcrespo Looking at your last comment, can this task be resolved? [15:00:20] 06SRE, 06Infrastructure-Foundations: Re-IP hosts running Cassandra to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354871#12066385 (10ayounsi) Left in codfw rows A-D are: aqs[2001-2012].codfw.wmnet and cassandra-dev[2001-2003].codfw.wmnet Ping @eevans ? eqiad is tracked in {T42... [15:03:41] (03CR) 10JMeybohm: [C:03+2] Rakefile: Support multiple istio versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306319 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [15:05:01] !log bking@cumin2003 conftool action : set/pooled=false; selector: dnsdisc=search,name=codfw [15:05:05] (03CR) 10CWilliams: "Nor me, it seems to be stuck. I pushed a trivial change to the testsand it seems to have woken it up, unlike a rebase." [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [15:05:15] (03CR) 10CWilliams: [C:03+2] Cookbook sre.mysql.upgrade should not accept multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [15:06:25] (03PS3) 10JavierMonton: stream: pageview-trending-relative-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306263 (https://phabricator.wikimedia.org/T430134) [15:06:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306339 [15:06:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306339 (owner: 10TrainBranchBot) [15:09:28] (03Merged) 10jenkins-bot: Cookbook sre.mysql.upgrade should not accept multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [15:09:30] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:09:39] (03CR) 10Muehlenhoff: [C:03+2] Bitu: Add Ahmon Dancy as second approver for Spiderpig access [puppet] - 10https://gerrit.wikimedia.org/r/1305568 (owner: 10Muehlenhoff) [15:10:32] (03PS4) 10Bking: WIP: cirrussearch: set hieradata for OpenSearch 1->2 migration [puppet] - 10https://gerrit.wikimedia.org/r/1304906 (https://phabricator.wikimedia.org/T429844) [15:10:54] (03PS5) 10Bking: cirrussearch: set hieradata for OpenSearch 1->2 migration [puppet] - 10https://gerrit.wikimedia.org/r/1304906 (https://phabricator.wikimedia.org/T429844) [15:11:03] (03CR) 10Atsuko: [C:03+1] cirrussearch: set hieradata for OpenSearch 1->2 migration [puppet] - 10https://gerrit.wikimedia.org/r/1304906 (https://phabricator.wikimedia.org/T429844) (owner: 10Bking) [15:11:14] (03PS6) 10Bking: cirrussearch: set hieradata for OpenSearch 1->2 migration [puppet] - 10https://gerrit.wikimedia.org/r/1304906 (https://phabricator.wikimedia.org/T429844) [15:11:19] (03CR) 10Bking: [C:03+2] cirrussearch: set hieradata for OpenSearch 1->2 migration [puppet] - 10https://gerrit.wikimedia.org/r/1304906 (https://phabricator.wikimedia.org/T429844) (owner: 10Bking) [15:11:20] (03CR) 10TChin: [C:03+2] [eventgate-*] Bump to v1.31.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305900 (https://phabricator.wikimedia.org/T415590) (owner: 10TChin) [15:12:36] (03CR) 10Andrew Bogott: [C:03+1] wikimedia.org: add dumps-nfs [dns] - 10https://gerrit.wikimedia.org/r/1305406 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [15:12:44] (03CR) 10JavierMonton: stream: pageview-trending-relative-next (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306263 (https://phabricator.wikimedia.org/T430134) (owner: 10JavierMonton) [15:12:54] (03PS1) 10Michael Große: postEdit: temp account experiment instrumentation [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306341 (https://phabricator.wikimedia.org/T429110) [15:13:00] (03CR) 10Andrew Bogott: [C:03+1] conftool-data: add dumps-nfs [puppet] - 10https://gerrit.wikimedia.org/r/1305402 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [15:13:15] (03CR) 10Andrew Bogott: [C:03+1] dumps: open nfs port to lb healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1305403 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [15:13:48] (03CR) 10Andrew Bogott: [C:03+1] dumps: add dumps-nfs service pool [puppet] - 10https://gerrit.wikimedia.org/r/1305405 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [15:14:03] (03CR) 10CI reject: [V:04-1] Clean up istio configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306327 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [15:14:08] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [15:14:24] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [15:14:34] (03PS1) 10Michael Große: maybeSendThankYouEdit: avoid sending notification to temp users [extensions/Echo] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306342 (https://phabricator.wikimedia.org/T429110) [15:14:59] (03CR) 10Andrew Bogott: [C:03+1] "I have some concern that the clients will just lock up or crash when a fail-over happens, but these patches look reasonable and there's on" [puppet] - 10https://gerrit.wikimedia.org/r/1305404 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [15:15:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306339 (owner: 10TrainBranchBot) [15:15:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306341 (https://phabricator.wikimedia.org/T429110) (owner: 10Michael Große) [15:15:49] (03CR) 10Hnowlan: [C:03+2] restbase: disable instance space icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1305083 (https://phabricator.wikimedia.org/T407141) (owner: 10Tiziano Fogli) [15:18:38] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2074.codfw.wmnet with OS trixie [15:19:29] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1260.eqiad.wmnet with OS trixie [15:20:16] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Deploy wmflib 3.0.0 to production - https://phabricator.wikimedia.org/T430552 (10MoritzMuehlenhoff) 03NEW [15:20:26] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Deploy wmflib 3.0.0 to production - https://phabricator.wikimedia.org/T430552#12066517 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:20:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Echo] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306342 (https://phabricator.wikimedia.org/T429110) (owner: 10Michael Große) [15:24:49] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [15:24:54] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [15:29:10] (03Merged) 10jenkins-bot: Rakefile: Support multiple istio versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306319 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [15:30:05] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T1530). Please do the needful. [15:30:16] (03CR) 10Elukey: [C:03+1] "If you want to also rename ./aux-k8s/config-1.24.yaml to 1.24.2 it will be more consistent, otherwise we can stick with config.yaml like m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306327 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [15:30:48] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2087.codfw.wmnet with OS trixie [15:30:48] (03CR) 10Elukey: [C:03+1] package_builder: Also specify apt key for three other source sources [puppet] - 10https://gerrit.wikimedia.org/r/1306271 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [15:30:54] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12066610 (10MoritzMuehlenhoff) [15:31:30] (03CR) 10Arnaudb: [C:03+2] "discussed in team meeting: lets exclude caches from backup" [puppet] - 10https://gerrit.wikimedia.org/r/1306164 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [15:32:44] !log installing glib2.0 security updates [15:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:14] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1260.eqiad.wmnet with reason: host reimage [15:34:41] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2074.codfw.wmnet with reason: host reimage [15:38:32] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12066712 (10MoritzMuehlenhoff) [15:38:53] !log installing zsh updates from Trixie point release [15:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:56] (03Merged) 10jenkins-bot: [eventgate-*] Bump to v1.31.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305900 (https://phabricator.wikimedia.org/T415590) (owner: 10TChin) [15:39:30] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1260.eqiad.wmnet with reason: host reimage [15:40:29] (03PS5) 10Btullis: topolvm: customise the imported chart for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305974 (https://phabricator.wikimedia.org/T429331) [15:40:29] (03PS5) 10Btullis: topolvm: tighten controller RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305975 (https://phabricator.wikimedia.org/T429331) [15:40:29] (03PS5) 10Btullis: topolvm: scrape controller/node metrics via prometheus.io annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306222 (https://phabricator.wikimedia.org/T429331) [15:40:30] (03PS6) 10Btullis: topolvm-crds: add the TopoLVM CRD for version 0.38.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305976 (https://phabricator.wikimedia.org/T429331) [15:40:30] (03PS5) 10Btullis: admin_ng: define the topolvm CSI releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306223 (https://phabricator.wikimedia.org/T429331) [15:40:31] (03PS6) 10Btullis: admin_ng: enable the topolvm CSI driver on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305978 (https://phabricator.wikimedia.org/T429331) [15:43:00] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2074.codfw.wmnet with reason: host reimage [15:46:27] (03PS15) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server change [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [15:47:12] (03CR) 10Muehlenhoff: sre.hosts.reboot-unattended: add new cookbook for unattended reboots (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1305707 (owner: 10Jelto) [15:47:13] !log bking@cumin2003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2074.codfw.wmnet with OS trixie [15:47:26] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306348 (https://phabricator.wikimedia.org/T128546) [15:47:59] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12066761 (10HCoplin-WMF) Thank you!! Really appreciate it :) [15:48:26] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12066766 (10MoritzMuehlenhoff) [15:49:26] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2087.codfw.wmnet with reason: host reimage [15:49:29] (03CR) 10BCornwall: [C:03+1] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306318 (https://phabricator.wikimedia.org/T430540) (owner: 10Gerrit maintenance bot) [15:50:05] !log installing libconfig-inifiles-perl security updates [15:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:38] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:51:37] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:51:43] (03PS1) 10Bking: cirrussearch: remove local logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/1306349 (https://phabricator.wikimedia.org/T324335) [15:53:29] (03CR) 10Atsuko: [C:03+1] cirrussearch: remove local logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/1306349 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [15:53:42] (03CR) 10Bking: [C:03+2] cirrussearch: remove local logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/1306349 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [15:54:04] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2087.codfw.wmnet with reason: host reimage [15:55:01] hey folks, going to do a portal deploy [15:56:36] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1260.eqiad.wmnet with OS trixie [15:57:27] bking@cumin2003 reimage (PID 1449767) is awaiting input [15:57:52] !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1306347|Assets build - 2026-06-29 15:36:06+00:00]] [15:58:22] (03PS1) 10Hnowlan: docker_registry: migrate nrpe checks to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1306351 (https://phabricator.wikimedia.org/T384321) [15:58:46] !log jdrewniak@deploy1003 portalsbuilder, jdrewniak: Backport for [[gerrit:1306347|Assets build - 2026-06-29 15:36:06+00:00]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:00:13] (03CR) 10Hnowlan: docker_registry: migrate nrpe checks to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1306351 (https://phabricator.wikimedia.org/T384321) (owner: 10Hnowlan) [16:02:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [16:02:59] (03CR) 10CDanis: "I think it's fair to not yet have a complete answer to this, but, do you expect this to be a permanent fork, or do you expect we'll want t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305973 (https://phabricator.wikimedia.org/T429331) (owner: 10Btullis) [16:04:57] !log jdrewniak@deploy1003 portalsbuilder, jdrewniak: Continuing with deployment [16:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:43] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12066894 (10BCornwall) There was an additional patch proposed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1305243 - Do we want this new messaging? [16:06:30] !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306347|Assets build - 2026-06-29 15:36:06+00:00]] (duration: 08m 37s) [16:08:36] (03CR) 10BCornwall: [C:03+2] Dumps: user-agent enforcement messaging [puppet] - 10https://gerrit.wikimedia.org/r/1305243 (https://phabricator.wikimedia.org/T427836) (owner: 10Hcoplin) [16:08:47] (03PS1) 10Dzahn: CDN: turn off caching for zuul.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1306353 (https://phabricator.wikimedia.org/T430462) [16:09:37] 10SRE-swift-storage, 06Commons: file missing after move - https://phabricator.wikimedia.org/T430561#12066912 (10Aklapper) [16:09:42] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:48] (03CR) 10Ahmon Dancy: [V:03+1] "Patchset 15 tested successfully" [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [16:13:08] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306280 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [16:13:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306348 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:13:58] (03PS3) 10Gergő Tisza: Remove security-related log hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306141 (https://phabricator.wikimedia.org/T430564) [16:14:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:16:38] (03CR) 10CI reject: [V:04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306348 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:18:37] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306355 [16:19:56] (03CR) 10Jdrewniak: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306348 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:21:33] (03PS3) 10JMeybohm: Copy wikikube istio config to config_1.29.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306279 (https://phabricator.wikimedia.org/T427401) [16:21:33] (03PS3) 10JMeybohm: istio/main: Bump to istio 1.29.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306280 (https://phabricator.wikimedia.org/T427401) [16:21:33] (03PS2) 10JMeybohm: Clean up istio configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306327 (https://phabricator.wikimedia.org/T427401) [16:21:40] (03CR) 10JMeybohm: "I don't have a strong preference as well. But today I got confused from grepping for the 1.15 istio version in the repo (and the fact that" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306327 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [16:24:05] (03CR) 10JMeybohm: [C:03+2] istio/main: Bump to istio 1.29.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306280 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [16:24:10] (03CR) 10JMeybohm: [C:03+2] Copy wikikube istio config to config_1.29.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306279 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [16:24:39] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1260: Migration of db1260.eqiad.wmnet completed [16:27:55] (03CR) 10Jdrewniak: [C:04-2] "blocking the TrainBranchBot merge, going to reschedule the deployment later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306348 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:32:01] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12067027 (10HCoplin-WMF) oh -- yes please! @BTullis said he would merge it sometime today, but whoever gets to it first works for me. [16:33:53] (03CR) 10BCornwall: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8801/console" [puppet] - 10https://gerrit.wikimedia.org/r/1305243 (https://phabricator.wikimedia.org/T427836) (owner: 10Hcoplin) [16:36:28] (03PS1) 10DLynch: Add missing resolveUrlOrTitle helper function [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306358 (https://phabricator.wikimedia.org/T430450) [16:36:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306358 (https://phabricator.wikimedia.org/T430450) (owner: 10DLynch) [16:41:28] (03PS1) 10DLynch: Add missing visualeditor-suggestion-link message to extension.json [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306361 (https://phabricator.wikimedia.org/T430450) [16:41:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306361 (https://phabricator.wikimedia.org/T430450) (owner: 10DLynch) [16:42:53] 06SRE, 10Data-Persistence-Backup, 06Infrastructure-Foundations, 07Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664#12067089 (10jcrespo) @LSobanski I am not sure what was the scope of the original work, if it is covered by the patch, it is done, but I do... [16:43:49] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12067095 (10BCornwall) Not sure why pcc is no-op-ing on both clouddumps but I merged anyway since it's just an HTML change. [16:44:25] 06SRE, 10Data-Persistence-Backup, 06Infrastructure-Foundations, 07Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664#12067097 (10jcrespo) @LSobanski Maybe a better take is: I am 100% to resolve it from my side, and you can always open another for further... [16:47:50] (03CR) 10Ottomata: [C:03+1] stream: pageview-trending-relative-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306263 (https://phabricator.wikimedia.org/T430134) (owner: 10JavierMonton) [16:55:45] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [16:59:23] 06SRE, 10Data-Persistence-Backup, 06Infrastructure-Foundations, 07Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664#12067158 (10LSobanski) 05Open→03Resolved a:03LSobanski Thanks. [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T1700) [17:00:04] ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T1700). Please do the needful. [17:00:44] RESOLVED: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [17:01:26] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12067176 (10HCoplin-WMF) Great! Thank you, again. I see the changes on the site now. Really appreciate the support on this :) [17:03:10] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2088.codfw.wmnet with OS trixie [17:05:38] !log bking@cumin2003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2087.codfw.wmnet with OS trixie [17:07:38] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:10:10] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1260: Migration of db1260.eqiad.wmnet completed [17:10:11] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [17:22:37] (03PS1) 10Cathal Mooney: Apply regular peering preference to primary IXP if AS-Path >= 3 hops [homer/public] - 10https://gerrit.wikimedia.org/r/1306369 [17:22:58] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2088.codfw.wmnet with reason: host reimage [17:28:47] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2088.codfw.wmnet with reason: host reimage [17:29:25] (03PS2) 10Cathal Mooney: Apply regular peering preference to primary IXP if AS-Path >= 3 hops [homer/public] - 10https://gerrit.wikimedia.org/r/1306369 [17:30:15] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2062.codfw.wmnet with OS trixie [17:41:49] (03PS1) 10Cathal Mooney: Treat HE as transit rather than regular peer over NL-IX and AMS-IX [homer/public] - 10https://gerrit.wikimedia.org/r/1306374 [17:48:24] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2062.codfw.wmnet with reason: host reimage [17:51:59] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2088.codfw.wmnet with OS trixie [17:55:02] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [17:55:15] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2062.codfw.wmnet with reason: host reimage [17:55:53] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [17:56:16] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2069.codfw.wmnet with OS trixie [18:02:52] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [18:03:36] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [18:07:54] 06SRE, 06collaboration-services: cumin2003 fails to connect to contint[12]003 - https://phabricator.wikimedia.org/T430510#12067481 (10Dzahn) first just wanted to add: contint2002 can connect to both, just contint2003 can not. [18:13:04] 06SRE, 06collaboration-services: cumin2003 fails to connect to contint[12]003 - https://phabricator.wikimedia.org/T430510#12067485 (10Dzahn) What might be special here is that these were first nftables and then I was reminded we needed to switch back to iptables on these hosts. So there could be remnants. I a... [18:15:03] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2069.codfw.wmnet with reason: host reimage [18:15:10] (03CR) 10Andrew Bogott: [C:03+2] modules/profile/files/puppet/bin: cleanup puppet SSL on CA server change [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [18:17:29] !log contint1003 - maintenance reboot [18:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:37] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2062.codfw.wmnet with OS trixie [18:19:17] 06SRE, 06collaboration-services: cumin2003 fails to connect to contint[12]003 - https://phabricator.wikimedia.org/T430510#12067500 (10Dzahn) Yea, so an `sudo nft flush ruleset` fixed it on contint2003. I am also doing `apt-get remove --purge nftables" on both and rebooted contint1003 in the process. [18:19:29] PROBLEM - Host contint1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:20:43] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2069.codfw.wmnet with reason: host reimage [18:21:06] RECOVERY - Host contint1003 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [18:21:20] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [18:21:43] expecting recovery in a moment [18:27:15] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on contint1003.wikimedia.org with reason: not active yet [18:27:35] !log contint2003 - maintenance reboot [18:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:42] PROBLEM - Host contint2003 is DOWN: PING CRITICAL - Packet loss = 100% [18:31:10] RECOVERY - Host contint2003 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [18:32:29] 06SRE, 06collaboration-services: cumin2003 fails to connect to contint[12]003 - https://phabricator.wikimedia.org/T430510#12067536 (10Dzahn) 05Open→03Resolved a:03Dzahn fixed! package removed with --purge and rebooted both hosts. `[cumin2003:~] $ sudo cumin contint*003* "uname -v" 2 hosts will be t... [18:33:58] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on contint2003.wikimedia.org with reason: not active yet [18:37:18] jouncebot: nowandnext [18:37:18] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [18:37:18] In 1 hour(s) and 22 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T2000) [18:39:02] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [18:39:56] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [18:42:19] (03CR) 10Zabe: [C:03+2] Remove remaining occurences of apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305999 (https://phabricator.wikimedia.org/T418494) (owner: 10Zabe) [18:42:36] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2069.codfw.wmnet with OS trixie [18:42:45] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [18:43:20] (03Merged) 10jenkins-bot: Remove remaining occurences of apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305999 (https://phabricator.wikimedia.org/T418494) (owner: 10Zabe) [18:43:28] (03PS1) 10Dzahn: phabricator: replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306390 (https://phabricator.wikimedia.org/T372666) [18:43:47] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1305999|Remove remaining occurences of apiportalwiki (T418494)]] [18:43:48] (03PS2) 10Ssingh: images/haproxy: set owner to Traffic [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1303420 [18:43:51] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [18:45:40] !log zabe@deploy1003 zabe: Backport for [[gerrit:1305999|Remove remaining occurences of apiportalwiki (T418494)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:45:45] (03CR) 10Ssingh: [V:03+2 C:03+2] images/haproxy: set owner to Traffic [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1303420 (owner: 10Ssingh) [18:47:44] RESOLVED: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [18:48:34] !log zabe@deploy1003 zabe: Continuing with deployment [18:49:43] (03CR) 10Dzahn: "puppet compilers seem broken right now" [puppet] - 10https://gerrit.wikimedia.org/r/1306390 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [18:51:16] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1306390 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [18:52:58] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305999|Remove remaining occurences of apiportalwiki (T418494)]] (duration: 09m 11s) [18:53:03] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [18:54:44] 06SRE, 06collaboration-services: cumin2003 fails to connect to contint[12]003 - https://phabricator.wikimedia.org/T430510#12067628 (10MoritzMuehlenhoff) >>! In T430510#12067485, @Dzahn wrote: > What might be special here is that these were first nftables and then I was reminded we needed to switch back to... [18:55:29] !incidents [18:55:30] 8104 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [18:55:58] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2075.codfw.wmnet with OS trixie [18:58:45] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [18:59:42] (03CR) 10Ayounsi: [C:03+1] Treat HE as transit rather than regular peer over NL-IX and AMS-IX [homer/public] - 10https://gerrit.wikimedia.org/r/1306374 (owner: 10Cathal Mooney) [19:01:58] (03CR) 10Cathal Mooney: [C:03+2] Treat HE as transit rather than regular peer over NL-IX and AMS-IX [homer/public] - 10https://gerrit.wikimedia.org/r/1306374 (owner: 10Cathal Mooney) [19:03:31] (03Merged) 10jenkins-bot: Treat HE as transit rather than regular peer over NL-IX and AMS-IX [homer/public] - 10https://gerrit.wikimedia.org/r/1306374 (owner: 10Cathal Mooney) [19:03:44] RESOLVED: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:07:41] (03CR) 10Bartosz Dziewoński: [C:03+1] Remove security-related log hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306141 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [19:11:37] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [19:12:05] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [19:13:04] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2090.codfw.wmnet with OS trixie [19:13:31] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2075.codfw.wmnet with reason: host reimage [19:13:32] (03PS1) 10Ladsgroup: varnish: Apply webp transformation more aggressively [puppet] - 10https://gerrit.wikimedia.org/r/1306395 (https://phabricator.wikimedia.org/T27611) [19:14:41] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [19:14:49] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [19:15:03] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [19:15:45] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1306390/8803/" [puppet] - 10https://gerrit.wikimedia.org/r/1306390 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [19:15:53] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [19:15:56] (03CR) 10Dzahn: [V:03+1] "doing a part of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1305985" [puppet] - 10https://gerrit.wikimedia.org/r/1306390 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [19:16:20] (03CR) 10Dzahn: [V:03+1 C:03+2] phabricator: replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306390 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [19:16:28] (03PS2) 10Dzahn: phabricator: replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306390 (https://phabricator.wikimedia.org/T372666) [19:16:45] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:16:50] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [19:17:38] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [19:18:44] (03CR) 10Dzahn: [C:03+2] phabricator: replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306390 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [19:19:07] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2075.codfw.wmnet with reason: host reimage [19:19:23] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [19:19:35] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [19:20:13] (03PS1) 10Andrew Bogott: Add profile::wmcs::daily_file_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1306398 (https://phabricator.wikimedia.org/T429578) [19:20:46] (03CR) 10CI reject: [V:04-1] Add profile::wmcs::daily_file_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1306398 (https://phabricator.wikimedia.org/T429578) (owner: 10Andrew Bogott) [19:21:17] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [19:21:44] RESOLVED: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:22:01] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [19:22:28] (03CR) 10Dzahn: [C:03+2] "confirmed on phab1004/phab1005/phab2003 this changed nothing" [puppet] - 10https://gerrit.wikimedia.org/r/1306390 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [19:22:33] (03PS2) 10Andrew Bogott: Add profile::wmcs::daily_file_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1306398 (https://phabricator.wikimedia.org/T429578) [19:24:21] (03PS1) 10Dzahn: gerrit: remove legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306399 (https://phabricator.wikimedia.org/T372666) [19:27:18] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [19:27:47] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [19:29:43] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1306399/8804/" [puppet] - 10https://gerrit.wikimedia.org/r/1306399 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [19:30:22] (03CR) 10Dzahn: [V:03+1 C:03+2] gerrit: remove legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306399 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [19:31:45] (03CR) 10Dzahn: "started to do some of these in separate patches - phab and gerrit are done - rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/1305985 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [19:31:50] (03PS2) 10JHathaway: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1305985 (https://phabricator.wikimedia.org/T372666) [19:32:45] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2090.codfw.wmnet with reason: host reimage [19:32:58] (03CR) 10Dzahn: "and yep, have been replacing legacy facts before in a bunch of places, these are the ones I had not caught yet. just that merging one gian" [puppet] - 10https://gerrit.wikimedia.org/r/1305985 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [19:33:38] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [19:33:48] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [19:34:24] (03PS1) 10Arlolra: Temporarily disable experimental ExtTagPFragment type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306401 (https://phabricator.wikimedia.org/T430344) [19:34:29] bking@cumin2003 reimage (PID 1516115) is awaiting input [19:35:57] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [19:36:14] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [19:37:22] (03CR) 10Dzahn: [V:03+1 C:03+2] "confirmed on gerrit1003/2002/2003 that this changed nothing" [puppet] - 10https://gerrit.wikimedia.org/r/1306399 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [19:38:06] (03CR) 10Subramanya Sastry: [C:03+1] Temporarily disable experimental ExtTagPFragment type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306401 (https://phabricator.wikimedia.org/T430344) (owner: 10Arlolra) [19:38:37] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [19:38:49] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [19:39:02] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2075.codfw.wmnet with OS trixie [19:39:10] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2090.codfw.wmnet with reason: host reimage [19:40:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306401 (https://phabricator.wikimedia.org/T430344) (owner: 10Arlolra) [19:43:38] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12067770 (10KFrancis) Hi all, the NDA has been sent for signatures. Thanks! [19:43:45] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:45:26] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [19:48:45] RESOLVED: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:49:35] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [19:51:13] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [19:51:19] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [19:51:37] 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Follow up on multiple RAID / drive issues - https://phabricator.wikimedia.org/T426610#12067811 (10Jclark-ctr) updated and added an-worker1231 Drive has been swapped [19:51:44] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:51:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Degraded RAID on an-worker1231 - https://phabricator.wikimedia.org/T430219#12067814 (10Jclark-ctr) Replaced drive and updated T430219 [19:52:08] 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Follow up on multiple RAID / drive issues - https://phabricator.wikimedia.org/T426610#12067822 (10Jclark-ctr) [19:52:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Degraded RAID on an-worker1231 - https://phabricator.wikimedia.org/T430219#12067829 (10Jclark-ctr) 05Open→03Resolved [19:54:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2039.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:54:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2039.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:55:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2039.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:55:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2039.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:56:44] RESOLVED: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:57:53] (03CR) 10BCornwall: [V:03+2 C:03+1] "Thanks for the great commit message!" [puppet] - 10https://gerrit.wikimedia.org/r/1306395 (https://phabricator.wikimedia.org/T27611) (owner: 10Ladsgroup) [19:58:38] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2090.codfw.wmnet with OS trixie [19:59:26] (03CR) 10JHathaway: "For sure, please break it up however you see fit, happy to help" [puppet] - 10https://gerrit.wikimedia.org/r/1305985 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [19:59:50] jhancock@cumin2002 reimage (PID 3172463) is awaiting input [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T2000). [20:00:05] VadymTS1, arlolra, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:22] o/ [20:00:44] o/ [20:01:01] Looks like I'm the only one with a non-config patch. I don't mind deploying mine myself. [20:02:17] Is VadymTS1 around? [20:02:24] Looks like that patch was already deployed [20:02:45] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [20:03:01] Yeah, looks like they got that this morning. [20:03:14] Ok, I'll do my patches then [20:03:19] (03CR) 10JHathaway: "happy to break it up per module, or if you want to do the slicing dicing, be my guest!" [puppet] - 10https://gerrit.wikimedia.org/r/1305988 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [20:03:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:04:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [20:04:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306401 (https://phabricator.wikimedia.org/T430344) (owner: 10Arlolra) [20:05:07] (03CR) 10Arlolra: [C:04-2] Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [20:05:12] (03CR) 10Arlolra: [C:04-2] Temporarily disable experimental ExtTagPFragment type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306401 (https://phabricator.wikimedia.org/T430344) (owner: 10Arlolra) [20:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:38] (03CR) 10Krinkle: varnish: Apply webp transformation more aggressively (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1306395 (https://phabricator.wikimedia.org/T27611) (owner: 10Ladsgroup) [20:07:46] (03CR) 10Krinkle: [C:04-1] varnish: Apply webp transformation more aggressively [puppet] - 10https://gerrit.wikimedia.org/r/1306395 (https://phabricator.wikimedia.org/T27611) (owner: 10Ladsgroup) [20:08:20] (03CR) 10Arlolra: Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [20:08:26] (03CR) 10Arlolra: Temporarily disable experimental ExtTagPFragment type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306401 (https://phabricator.wikimedia.org/T430344) (owner: 10Arlolra) [20:08:40] (03Merged) 10jenkins-bot: Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [20:08:44] (03Merged) 10jenkins-bot: Temporarily disable experimental ExtTagPFragment type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306401 (https://phabricator.wikimedia.org/T430344) (owner: 10Arlolra) [20:09:17] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1305724|Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic (T430194)]], [[gerrit:1306401|Temporarily disable experimental ExtTagPFragment type (T430344 T429624)]] [20:09:25] T430194: Parsoid Read Views deploy to English Wikipedia (enwiki) June 25-June 30 - https://phabricator.wikimedia.org/T430194 [20:09:25] T430344: Possible Parsoid bug with titles - https://phabricator.wikimedia.org/T430344 [20:09:26] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [20:11:12] !log arlolra@deploy1003 arlolra, cscott: Backport for [[gerrit:1305724|Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic (T430194)]], [[gerrit:1306401|Temporarily disable experimental ExtTagPFragment type (T430344 T429624)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:12:45] RESOLVED: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [20:16:44] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [20:16:57] (03PS36) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:16:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2039.codfw.wmnet with OS bullseye [20:17:14] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: FY2526 Q3:rack/setup/install restbase2039 - https://phabricator.wikimedia.org/T416538#12067921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host restbase2039.codfw.wmnet with OS bullseye [20:18:16] !log arlolra@deploy1003 arlolra, cscott: Continuing with deployment [20:20:48] (03PS37) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:21:44] RESOLVED: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [20:22:33] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305724|Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic (T430194)]], [[gerrit:1306401|Temporarily disable experimental ExtTagPFragment type (T430344 T429624)]] (duration: 13m 16s) [20:22:41] T430194: Parsoid Read Views deploy to English Wikipedia (enwiki) June 25-June 30 - https://phabricator.wikimedia.org/T430194 [20:22:42] T430344: Possible Parsoid bug with titles - https://phabricator.wikimedia.org/T430344 [20:22:42] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [20:22:54] Kemayo: all yours [20:23:04] arlolra: Thanks! [20:23:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306358 (https://phabricator.wikimedia.org/T430450) (owner: 10DLynch) [20:23:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306361 (https://phabricator.wikimedia.org/T430450) (owner: 10DLynch) [20:23:44] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [20:25:09] (03Merged) 10jenkins-bot: Add missing resolveUrlOrTitle helper function [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306358 (https://phabricator.wikimedia.org/T430450) (owner: 10DLynch) [20:28:44] RESOLVED: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [20:31:22] (03Merged) 10jenkins-bot: Add missing visualeditor-suggestion-link message to extension.json [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306361 (https://phabricator.wikimedia.org/T430450) (owner: 10DLynch) [20:31:41] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1306358|Add missing resolveUrlOrTitle helper function (T430450)]], [[gerrit:1306361|Add missing visualeditor-suggestion-link message to extension.json (T430450)]] [20:31:46] T430450: Uncaught TypeError: mw.libs.ve.resolveUrlOrTitle is not a function - https://phabricator.wikimedia.org/T430450 [20:33:31] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1306358|Add missing resolveUrlOrTitle helper function (T430450)]], [[gerrit:1306361|Add missing visualeditor-suggestion-link message to extension.json (T430450)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:35:00] !log kemayo@deploy1003 kemayo: Continuing with deployment [20:36:19] (03PS1) 10DLynch: SuggestedLinkEditCheck: by default import the config of the growth task [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306409 (https://phabricator.wikimedia.org/T422730) [20:37:45] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [20:39:08] (03CR) 10Andrew Bogott: "First thoughts:" [puppet] - 10https://gerrit.wikimedia.org/r/1294864 (https://phabricator.wikimedia.org/T423549) (owner: 10Komla Sapaty) [20:39:19] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306358|Add missing resolveUrlOrTitle helper function (T430450)]], [[gerrit:1306361|Add missing visualeditor-suggestion-link message to extension.json (T430450)]] (duration: 07m 38s) [20:39:25] T430450: Uncaught TypeError: mw.libs.ve.resolveUrlOrTitle is not a function - https://phabricator.wikimedia.org/T430450 [20:40:06] (03PS3) 10DLynch: SuggestedLinkEditCheck: make non-experimental [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306410 (https://phabricator.wikimedia.org/T421968) [20:41:46] (03PS1) 10Ebernhardson: Revert^3 "cirrus: AB test query suggester variants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306414 (https://phabricator.wikimedia.org/T407432) [20:42:46] (03PS4) 10DLynch: SuggestedLinkEditCheck: make non-experimental [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306410 (https://phabricator.wikimedia.org/T421968) [20:44:43] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: FY2526 Q3:rack/setup/install restbase2039 - https://phabricator.wikimedia.org/T416538#12068070 (10Jhancock.wm) @Eevans hey was waiting on a patch for the supermicro servers and overlooked something. I don't see an efi file in preseed.yaml for this server. C... [20:49:22] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS trixie [20:51:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306409 (https://phabricator.wikimedia.org/T422730) (owner: 10DLynch) [20:51:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306410 (https://phabricator.wikimedia.org/T421968) (owner: 10DLynch) [20:53:23] (03Merged) 10jenkins-bot: SuggestedLinkEditCheck: by default import the config of the growth task [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306409 (https://phabricator.wikimedia.org/T422730) (owner: 10DLynch) [20:53:26] (03Merged) 10jenkins-bot: SuggestedLinkEditCheck: make non-experimental [extensions/VisualEditor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306410 (https://phabricator.wikimedia.org/T421968) (owner: 10DLynch) [20:53:45] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1306409|SuggestedLinkEditCheck: by default import the config of the growth task (T422730)]], [[gerrit:1306410|SuggestedLinkEditCheck: make non-experimental (T421968)]] [20:53:53] T422730: Develop an API that enables Edit Checks/Suggestions to be kept in sync with Growth Experiments - https://phabricator.wikimedia.org/T422730 [20:53:53] T421968: [Suggestion] Deploy "Add a link" suggestion as a default-on suggestion - https://phabricator.wikimedia.org/T421968 [20:55:36] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1306409|SuggestedLinkEditCheck: by default import the config of the growth task (T422730)]], [[gerrit:1306410|SuggestedLinkEditCheck: make non-experimental (T421968)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:55:43] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2073.codfw.wmnet with OS trixie [20:58:36] (03CR) 10CDobbins: "cp2044: 0 tests failed, 0 tests skipped, 20 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [20:58:45] (03CR) 10CDobbins: varnish: Add CSP report-only header value (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [20:59:47] !log kemayo@deploy1003 kemayo: Continuing with deployment [21:00:05] alexsanford, Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T2100). [21:03:30] hi! preparing to do a security deploy [21:04:10] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306409|SuggestedLinkEditCheck: by default import the config of the growth task (T422730)]], [[gerrit:1306410|SuggestedLinkEditCheck: make non-experimental (T421968)]] (duration: 10m 25s) [21:04:16] T422730: Develop an API that enables Edit Checks/Suggestions to be kept in sync with Growth Experiments - https://phabricator.wikimedia.org/T422730 [21:04:16] T421968: [Suggestion] Deploy "Add a link" suggestion as a default-on suggestion - https://phabricator.wikimedia.org/T421968 [21:07:38] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:08:13] looks like some backports just wrapped up? [21:08:30] Kemayo are you done? [21:08:43] maryum: Yup [21:08:51] lovely [21:09:16] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2091.codfw.wmnet with reason: host reimage [21:11:07] (03PS5) 10Dzahn: jenkins: have systemd monitoring based on mask status [puppet] - 10https://gerrit.wikimedia.org/r/1305951 (https://phabricator.wikimedia.org/T430114) [21:11:41] (03CR) 10Dzahn: "not what I originally had in mind but useful regardless, clarified commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1305951 (https://phabricator.wikimedia.org/T430114) (owner: 10Dzahn) [21:12:43] !log bking@cumin2003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2089.codfw.wmnet'] [21:12:59] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2073.codfw.wmnet with reason: host reimage [21:13:50] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2091.codfw.wmnet with reason: host reimage [21:14:16] (03CR) 10Dzahn: [C:03+2] jenkins: have systemd monitoring based on mask status [puppet] - 10https://gerrit.wikimedia.org/r/1305951 (https://phabricator.wikimedia.org/T430114) (owner: 10Dzahn) [21:14:46] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T430594 (10Rscout) 03NEW [21:16:19] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T430594#12068145 (10Rscout) FYI and your approval, @Rsilvola :) [21:16:42] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T430594#12068146 (10Rscout) @Rscout [21:17:24] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2073.codfw.wmnet with reason: host reimage [21:19:53] (03CR) 10Btullis: "It's a very good question. I suspect that what we will do is keep pulling in upstream changes, but keep applying our own patches where ups" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305973 (https://phabricator.wikimedia.org/T429331) (owner: 10Btullis) [21:21:25] !log Deployed security fix for T430548 [21:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:17] !log bking@cumin2003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2089.codfw.wmnet'] [21:28:31] FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:39] bking@cumin2003 reimage (PID 1534875) is awaiting input [21:29:44] I have another patch to deploy, working on a small issue [21:30:00] !log bking@cumin2003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2091.codfw.wmnet with OS trixie [21:30:46] 06SRE, 10CFSSL-PKI, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 07Puppet (Puppet 7.0): Create dynamic CRL - https://phabricator.wikimedia.org/T340543#12068190 (10jhathaway) while investigating whether this task could be resolved, I found... We seem to be checking the CRL, based on the agent l... [21:30:47] 06SRE, 10SRE-Access-Requests: Requesting access to production for RScout-WMF - https://phabricator.wikimedia.org/T430594#12068191 (10Rscout) [21:31:44] 06SRE, 10SRE-Access-Requests: Requesting access to production for rscout - https://phabricator.wikimedia.org/T430594#12068196 (10Rscout) [21:33:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:48:08] (03CR) 10Btullis: [C:03+2] matomo: Enable the CustomDimensions plugin [puppet] - 10https://gerrit.wikimedia.org/r/1305903 (https://phabricator.wikimedia.org/T430307) (owner: 10Btullis) [21:48:21] issue resolved, continuing to deploy a second and last patch [21:49:43] second patch not going out today more likely on Thursday [21:49:48] security deploy is finished for today [21:57:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:57:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:00:06] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: OOB IPV6 down - https://phabricator.wikimedia.org/T430599 (10Papaul) 03NEW [22:00:14] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: OOB IPV6 down - https://phabricator.wikimedia.org/T430599#12068267 (10Papaul) p:05Triage→03Medium [22:02:40] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:02:40] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:10:02] (03CR) 10Ladsgroup: varnish: Apply webp transformation more aggressively (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1306395 (https://phabricator.wikimedia.org/T27611) (owner: 10Ladsgroup) [22:15:33] (03PS2) 10Ladsgroup: varnish: Apply webp transformation more aggressively [puppet] - 10https://gerrit.wikimedia.org/r/1306395 (https://phabricator.wikimedia.org/T27611) [22:19:54] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for rscout - https://phabricator.wikimedia.org/T430594#12068303 (10Dzahn) [22:20:16] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for rscout - https://phabricator.wikimedia.org/T430594#12068305 (10Dzahn) @thcipriani hello, please take a look [22:20:58] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for rscout - https://phabricator.wikimedia.org/T430594#12068308 (10Dzahn) 05Open→03In progress [22:23:05] (03Abandoned) 10Btullis: wdqs: Add config for net-new wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1289428 (https://phabricator.wikimedia.org/T423314) (owner: 10Bking) [22:23:47] (03Abandoned) 10Btullis: ceph-rbd: Bump the ceph-csi plugin image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081112 (https://phabricator.wikimedia.org/T376401) (owner: 10Btullis) [22:24:16] (03Abandoned) 10Btullis: Deploy the updated ceph-csi container plugin to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134718 (https://phabricator.wikimedia.org/T389184) (owner: 10Btullis) [22:31:06] (03CR) 10Dzahn: [C:03+2] zuul: disable nodepool fallback [puppet] - 10https://gerrit.wikimedia.org/r/1306122 (https://phabricator.wikimedia.org/T424879) (owner: 10Hashar) [22:33:52] (03CR) 10Dzahn: [C:03+2] phabricator: drop differential.allow-self-accept config [puppet] - 10https://gerrit.wikimedia.org/r/1305041 (https://phabricator.wikimedia.org/T330797) (owner: 10Aklapper) [22:46:31] 06SRE, 10Wikimedia-Mailing-lists: Some req staff not receiving emails to wmfreqs@lists.wikimedia.org - https://phabricator.wikimedia.org/T430308#12068361 (10Dzahn) You can reach the owners of this list at: `wmfreqs-owner@lists.wikimedia.org`. Please try this first to see who is actually maintaining it and how... [22:47:48] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): dse-k8s-codfw istiogateway misconfiguration - https://phabricator.wikimedia.org/T430504#12068365 (10BTullis) 05Open→03Resolved Here is the investigation and corrective action that @atsuko and I carried out. It's a duplicate of T430484#12064554 > We... [22:51:47] 06SRE, 10Wikimedia-Mailing-lists: Some req staff not receiving emails to wmfreqs@lists.wikimedia.org - https://phabricator.wikimedia.org/T430308#12068383 (10Dzahn) btw, you can also suggest users to be added to that list via this page: https://lists.wikimedia.org/postorius/lists/wmfreqs.lists.wikimedia.org/... [22:54:13] 06SRE, 10Wikimedia-Mailing-lists: Some req staff not receiving emails to wmfreqs@lists.wikimedia.org - https://phabricator.wikimedia.org/T430308#12068389 (10Ladsgroup) I think the more likely cause was some sort of outage or issue that MTA started dropping emails. Since people mentioned not getting emails in w... [22:54:23] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Our World in Data gadget updates - https://phabricator.wikimedia.org/T429131#12068391 (10Dzahn) The hold-up here is that this is supposed to be done by rotating clinic duty. [22:56:33] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Our World in Data gadget updates - https://phabricator.wikimedia.org/T429131#12068393 (10Dzahn) The other one is that there are sometimes concerns about names not fitting in with https://meta.wikimedia.org/wiki/Mailing_lists/Standardization But it says "Othe... [23:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260629T2300) [23:00:33] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Our World in Data gadget updates - https://phabricator.wikimedia.org/T429131#12068399 (10Dzahn) 05Open→03Resolved a:03Dzahn @Doc_James You should have received email with an initial password that tells you this list has been created. I set the 2 adm... [23:08:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:09:21] (03CR) 10BCornwall: [C:03+1] varnish: Apply webp transformation more aggressively [puppet] - 10https://gerrit.wikimedia.org/r/1306395 (https://phabricator.wikimedia.org/T27611) (owner: 10Ladsgroup) [23:13:44] (03CR) 10Ladsgroup: "puppet on all of upload cluster is now disabled." [puppet] - 10https://gerrit.wikimedia.org/r/1306395 (https://phabricator.wikimedia.org/T27611) (owner: 10Ladsgroup) [23:13:49] (03PS3) 10Ladsgroup: varnish: Apply webp transformation more aggressively [puppet] - 10https://gerrit.wikimedia.org/r/1306395 (https://phabricator.wikimedia.org/T27611) [23:13:54] (03CR) 10Ladsgroup: [V:03+2 C:03+2] varnish: Apply webp transformation more aggressively [puppet] - 10https://gerrit.wikimedia.org/r/1306395 (https://phabricator.wikimedia.org/T27611) (owner: 10Ladsgroup) [23:14:13] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Our World in Data gadget updates - https://phabricator.wikimedia.org/T429131#12068447 (10Doc_James) Thanks [23:15:43] Running this on cp3075 since it's low time on esams [23:15:49] ack [23:20:03] puppet agent looks correct and all. the varnish-frontend systemd service hasn't been restarted but I guess since it just reloads the config [23:23:09] correct [23:23:17] that's handled with vcl-reload [23:23:23] (which is automatically run) [23:23:47] (03PS1) 10Btullis: datahub: pin the production release to chart 0.0.81 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306440 (https://phabricator.wikimedia.org/T402408) [23:23:50] (03PS1) 10Btullis: datahub: upgrade chart to DataHub 1.6.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306441 (https://phabricator.wikimedia.org/T402408) [23:23:52] (03PS1) 10Btullis: datahub-next: use DataHub 1.6.0 images for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306442 (https://phabricator.wikimedia.org/T402408) [23:26:40] brett: going all in now [23:26:47] ack [23:32:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:37:25] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for rscout - https://phabricator.wikimedia.org/T430594#12068475 (10thcipriani) >>! In T430594#12068303, @Dzahn wrote: > @thcipriani hello, please take a look Approved. @Rscout if you need access to our web deploy tool (as opposed to only doing depl... [23:37:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:14] !log bking@localhost raise the number of incoming shard recoveries from 4 to 7 on all search_codfw endpoints T429844 [23:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:18] T429844: Migrate production OpenSearch clusters from 1.x-2.x - CODFW - https://phabricator.wikimedia.org/T429844 [23:38:20] sigh, thumbor can't take the load but it should recover [23:38:39] already recovering [23:40:34] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12068484 (10BCornwall) What is the duration of the maintenance? [23:42:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306443 [23:42:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306443 (owner: 10TrainBranchBot) [23:44:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:45:10] Amir1: this you? [23:45:28] one sec [23:49:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:50:06] (03CR) 10BryanDavis: [C:03+1] "LGTM. Adding mutante as someone who can help work out how to get consensus for merging this and do the needful once that is established. L" [puppet] - 10https://gerrit.wikimedia.org/r/1306161 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [23:50:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306443 (owner: 10TrainBranchBot) [23:56:24] (03CR) 10Dzahn: [C:03+1] proxy: Allow outbount HTTPS connections to port 25000 [puppet] - 10https://gerrit.wikimedia.org/r/1306161 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [23:57:19] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for rscout - https://phabricator.wikimedia.org/T430594#12068510 (10Dzahn)