[00:00:56] !log T429844 [opensearch] chi cluster recovered after stopping `opensearch_1@production-search-codfw` on `cirrussearch2111` [00:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:51] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2111.codfw.wmnet with OS trixie [00:22:18] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2111.codfw.wmnet with reason: host reimage [00:22:24] I'm unsure why we never saw IRC recoveries for `PROBLEM - OpenSearch health check` because the cluster is healthy again and when I glance at icinga it looks happy [00:29:23] !log ryankemper@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=search-omega,name=codfw [00:29:24] !log ryankemper@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=search-psi,name=codfw [00:29:26] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2111.codfw.wmnet with reason: host reimage [00:29:39] !log T429844 [opensearch] depooled codfw search-omega/search-psi discovery records to match existing codfw search depool during OpenSearch 2.19 migration [00:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:42] T429844: Migrate production OpenSearch clusters from 1.x-2.x - CODFW - https://phabricator.wikimedia.org/T429844 [00:55:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:56:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:57:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:57:39] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2111.codfw.wmnet with OS trixie [01:12:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306998 [01:12:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306998 (owner: 10TrainBranchBot) [01:16:38] !log T429844 [opensearch] completed `cirrussearch2111` reimage; all codfw search clusters are green, all nodes now report `OpenSearch 2.19.5`, and the temporary chi voting exclusion has been removed [01:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:42] T429844: Migrate production OpenSearch clusters from 1.x-2.x - CODFW - https://phabricator.wikimedia.org/T429844 [01:20:46] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306998 (owner: 10TrainBranchBot) [02:00:50] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:50] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 59s) [02:09:43] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:43] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:20] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12078721 (10Papaul) 05Open→03Resolved Complete [02:34:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12078723 (10Papaul) [02:35:18] (03CR) 10BCornwall: [V:03+1 C:03+1] "Tests pass, and this looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [02:35:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: rack A8 maintenance 2026-07-01 10:00 am CT - https://phabricator.wikimedia.org/T429856#12078724 (10Papaul) 05Open→03Resolved Complete [02:36:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12078726 (10Papaul) [02:37:27] (03CR) 10BCornwall: [C:03+1] cache::haproxy: add correlation id feature [puppet] - 10https://gerrit.wikimedia.org/r/1305635 (https://phabricator.wikimedia.org/T426379) (owner: 10Fabfur) [03:06:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:27:57] (03PS2) 10Reedy: CommonSettings: Set $wgScoreUseSvg = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298928 (https://phabricator.wikimedia.org/T49578) [04:29:52] (03CR) 10Tim Starling: [C:03+2] "I can test it with WikimediaDebug during scap." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298928 (https://phabricator.wikimedia.org/T49578) (owner: 10Reedy) [04:31:00] (03Merged) 10jenkins-bot: CommonSettings: Set $wgScoreUseSvg = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298928 (https://phabricator.wikimedia.org/T49578) (owner: 10Reedy) [04:36:24] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1298928|CommonSettings: Set $wgScoreUseSvg = true (T49578)]] [04:36:27] T49578: Score should output SVG - https://phabricator.wikimedia.org/T49578 [04:38:46] !log tstarling@deploy1003 tstarling, reedy: Backport for [[gerrit:1298928|CommonSettings: Set $wgScoreUseSvg = true (T49578)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [04:41:08] !log tstarling@deploy1003 tstarling, reedy: Continuing with deployment [04:45:32] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1298928|CommonSettings: Set $wgScoreUseSvg = true (T49578)]] (duration: 09m 08s) [04:45:35] T49578: Score should output SVG - https://phabricator.wikimedia.org/T49578 [04:55:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:56:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:57:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:14:02] (03PS1) 10Giuseppe Lavagetto: hiddenparma: add known fingerprints file [puppet] - 10https://gerrit.wikimedia.org/r/1307013 [05:16:55] !incidents [05:16:56] 8120 (ACKED, 25h 41m old) es1039 (paged)/MariaDB Replica SQL: es7 (paged) [05:16:56] 8121 (ACKED, 25h 41m old) es1039 (paged)/MariaDB Replica IO: es7 (paged) [05:16:56] 8123 (ACKED) [3x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad) [05:16:56] 8134 (ACKED) Host 10.3.0.1 [05:16:56] 8135 (ACKED) Host cloudelastic.wikimedia.org [05:16:57] 8144 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl2004:6443 probes/custom codfw) [05:16:57] 8143 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [05:16:57] 8127 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [05:16:58] 8126 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule@main) [05:16:58] 8124 (RESOLVED) [50x] ProbeDown sre () [05:16:59] 8141 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [05:19:15] !resolve 8120 [05:19:15] 8120 (RESOLVED, 25h 43m old) es1039 (paged)/MariaDB Replica SQL: es7 (paged) [05:19:19] !resolve 8121 [05:19:19] 8121 (RESOLVED, 25h 43m old) es1039 (paged)/MariaDB Replica IO: es7 (paged) [05:19:21] !ack [05:19:21] All incidents are already acked. [05:19:30] !incidents [05:19:30] 8123 (ACKED) [3x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad) [05:19:31] 8134 (ACKED) Host 10.3.0.1 [05:19:31] 8135 (ACKED) Host cloudelastic.wikimedia.org [05:19:31] 8144 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl2004:6443 probes/custom codfw) [05:19:31] 8143 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [05:19:31] 8127 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [05:19:32] 8126 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule@main) [05:19:32] 8124 (RESOLVED) [50x] ProbeDown sre () [05:19:33] 8141 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [05:24:38] (03PS1) 10Marostegui: dbproxy1026,dbproxy1028: Test db1228 on proxies [puppet] - 10https://gerrit.wikimedia.org/r/1307014 (https://phabricator.wikimedia.org/T430158) [05:28:17] (03PS2) 10Marostegui: dbproxy1026,dbproxy1028: Test db1228 on proxies [puppet] - 10https://gerrit.wikimedia.org/r/1307014 (https://phabricator.wikimedia.org/T430158) [05:29:02] (03CR) 10Marostegui: [C:03+2] dbproxy1026,dbproxy1028: Test db1228 on proxies [puppet] - 10https://gerrit.wikimedia.org/r/1307014 (https://phabricator.wikimedia.org/T430158) (owner: 10Marostegui) [05:31:38] (03PS1) 10Marostegui: Revert "dbproxy1026,dbproxy1028: Test db1228 on proxies" [puppet] - 10https://gerrit.wikimedia.org/r/1307015 [05:32:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2234].codfw.wmnet,db[1217,1228,1250].eqiad.wmnet with reason: m3 master switchover T430158 [05:32:07] T430158: Switchover m3 (phabricator) master (db1250 -> db1228) - https://phabricator.wikimedia.org/T430158 [05:32:26] (03CR) 10Marostegui: [C:03+2] Revert "dbproxy1026,dbproxy1028: Test db1228 on proxies" [puppet] - 10https://gerrit.wikimedia.org/r/1307015 (owner: 10Marostegui) [05:35:35] (03PS1) 10Marostegui: mariadb: Promote db1228 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1307016 (https://phabricator.wikimedia.org/T430158) [05:36:32] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1228 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1307016 (https://phabricator.wikimedia.org/T430158) (owner: 10Marostegui) [05:39:29] !log Failover m3 (phabricator) from db1250 to db1228 - T430158 [05:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:33] T430158: Switchover m3 (phabricator) master (db1250 -> db1228) - https://phabricator.wikimedia.org/T430158 [05:44:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1250.eqiad.wmnet with reason: m3 master switchover T430158 [05:44:36] (03PS1) 10Marostegui: db1250: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1307025 (https://phabricator.wikimedia.org/T430106) [05:45:17] (03CR) 10Marostegui: [C:03+2] db1250: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1307025 (https://phabricator.wikimedia.org/T430106) (owner: 10Marostegui) [05:55:25] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1250.eqiad.wmnet with OS trixie [05:58:59] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 40 hosts with reason: Primary switchover s4 T430817 [05:59:02] T430817: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T430817 [05:59:28] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Set db1160 with weight 0 T430817', diff saved to https://phabricator.wikimedia.org/P94673 and previous config saved to /var/cache/conftool/dbconfig/20260702-055927-cwilliams.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T0600) [06:00:05] marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T0600). [06:04:50] !log aokoth@cumin1003 START - Cookbook sre.vrts.upgrade on VRTS host vrts1003.eqiad.wmnet [06:05:25] (03CR) 10CWilliams: [C:03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1306927 (https://phabricator.wikimedia.org/T430817) (owner: 10Gerrit maintenance bot) [06:06:49] !log Starting s4 eqiad failover from db1244 to db1160 - T430817 [06:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:52] T430817: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T430817 [06:07:05] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T430817', diff saved to https://phabricator.wikimedia.org/P94674 and previous config saved to /var/cache/conftool/dbconfig/20260702-060704-cwilliams.json [06:07:47] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Promote db1160 to s4 primary and set section read-write T430817', diff saved to https://phabricator.wikimedia.org/P94675 and previous config saved to /var/cache/conftool/dbconfig/20260702-060746-cwilliams.json [06:08:12] !log aokoth@cumin1003 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1003.eqiad.wmnet [06:09:37] !log cwilliams@dns1006 START - running authdns-update [06:09:43] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1250.eqiad.wmnet with reason: host reimage [06:09:55] (03PS2) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306928 (https://phabricator.wikimedia.org/T430817) [06:11:00] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Depool db1244 T430817', diff saved to https://phabricator.wikimedia.org/P94676 and previous config saved to /var/cache/conftool/dbconfig/20260702-061059-cwilliams.json [06:11:38] !log cwilliams@dns1006 END - running authdns-update [06:11:43] (03CR) 10CWilliams: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306928 (https://phabricator.wikimedia.org/T430817) (owner: 10Gerrit maintenance bot) [06:12:20] !log cwilliams@dns1006 START - running authdns-update [06:14:30] !log cwilliams@dns1006 END - running authdns-update [06:15:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1250.eqiad.wmnet with reason: host reimage [06:22:40] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:25:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [06:25:36] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [06:25:40] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [06:25:45] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1244: Upgrading db1244.eqiad.wmnet [06:25:55] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1244: Upgrading db1244.eqiad.wmnet [06:28:55] cwilliams@cumin1003 major-upgrade (PID 851467) is awaiting input [06:29:43] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:34:53] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1244.eqiad.wmnet with OS trixie [06:38:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1250.eqiad.wmnet with OS trixie [06:46:24] (03PS1) 10Marostegui: backup1013.cnf.erb: Change es2 host [puppet] - 10https://gerrit.wikimedia.org/r/1307054 (https://phabricator.wikimedia.org/T408772) [06:50:01] (03PS1) 10Muehlenhoff: Record LDAP access for tlepage [puppet] - 10https://gerrit.wikimedia.org/r/1307055 [06:50:16] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1244.eqiad.wmnet with reason: host reimage [06:52:03] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for tlepage [puppet] - 10https://gerrit.wikimedia.org/r/1307055 (owner: 10Muehlenhoff) [06:54:34] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1244.eqiad.wmnet with reason: host reimage [07:00:05] Amir1, urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T0700). nyaa~ [07:00:05] WMDE-Fisch and cscott: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:11] \o I'll self serve and would just start [07:01:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [extensions/Cite] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306970 (https://phabricator.wikimedia.org/T415904) (owner: 10WMDE-Fisch) [07:01:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [extensions/Cite] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306971 (https://phabricator.wikimedia.org/T415904) (owner: 10WMDE-Fisch) [07:06:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:41] (03Merged) 10jenkins-bot: Fix how to check the treatment group [extensions/Cite] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306970 (https://phabricator.wikimedia.org/T415904) (owner: 10WMDE-Fisch) [07:08:43] (03Merged) 10jenkins-bot: Fix how to check the treatment group [extensions/Cite] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306971 (https://phabricator.wikimedia.org/T415904) (owner: 10WMDE-Fisch) [07:09:20] !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1306970|Fix how to check the treatment group (T415904)]], [[gerrit:1306971|Fix how to check the treatment group (T415904)]] [07:09:23] T415904: [Epic] Experiment Reader Footnote Click Intent - https://phabricator.wikimedia.org/T415904 [07:09:46] (03CR) 10Arnaudb: [C:03+2] backup: restrict gerrit-repo-data fileset to git and LFS [puppet] - 10https://gerrit.wikimedia.org/r/1306863 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [07:11:04] o/ [07:11:17] WMDE-Fisch: let me know when you're done? [07:11:28] !log wmde-fisch@deploy1003 wmde-fisch: Backport for [[gerrit:1306970|Fix how to check the treatment group (T415904)]], [[gerrit:1306971|Fix how to check the treatment group (T415904)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:11:31] cscott: Sure [07:11:53] !log wmde-fisch@deploy1003 wmde-fisch: Continuing with deployment [07:13:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1244.eqiad.wmnet with OS trixie [07:16:15] !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306970|Fix how to check the treatment group (T415904)]], [[gerrit:1306971|Fix how to check the treatment group (T415904)]] (duration: 06m 55s) [07:16:19] T415904: [Epic] Experiment Reader Footnote Click Intent - https://phabricator.wikimedia.org/T415904 [07:16:36] cscott: I'm done. Feel free to continue! [07:16:47] thanks! [07:18:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306985 (https://phabricator.wikimedia.org/T430194) (owner: 10Subramanya Sastry) [07:22:33] (03Merged) 10jenkins-bot: Parsoid read views: Bump enwiki NS_MAIN desktop traffic to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306985 (https://phabricator.wikimedia.org/T430194) (owner: 10Subramanya Sastry) [07:22:57] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1306985|Parsoid read views: Bump enwiki NS_MAIN desktop traffic to 100% (T430194)]] [07:23:00] T430194: Parsoid Read Views deploy to English Wikipedia (enwiki) June 25-June 30 - https://phabricator.wikimedia.org/T430194 [07:23:03] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1244: Migration of db1244.eqiad.wmnet completed [07:25:05] !log cscott@deploy1003 ssastry, cscott: Backport for [[gerrit:1306985|Parsoid read views: Bump enwiki NS_MAIN desktop traffic to 100% (T430194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:26:07] !log cscott@deploy1003 ssastry, cscott: Continuing with deployment [07:26:39] (03PS1) 10C. Scott Ananian: [REST] Don't language-convert non-parsoid output; don't lookup bogus titles [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307058 (https://phabricator.wikimedia.org/T430778) [07:26:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307058 (https://phabricator.wikimedia.org/T430778) (owner: 10C. Scott Ananian) [07:30:26] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306985|Parsoid read views: Bump enwiki NS_MAIN desktop traffic to 100% (T430194)]] (duration: 07m 28s) [07:30:29] T430194: Parsoid Read Views deploy to English Wikipedia (enwiki) June 25-June 30 - https://phabricator.wikimedia.org/T430194 [07:31:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307058 (https://phabricator.wikimedia.org/T430778) (owner: 10C. Scott Ananian) [07:31:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306996 (https://phabricator.wikimedia.org/T430344) (owner: 10C. Scott Ananian) [07:38:22] (03CR) 10Komla Sapaty: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1294864 (https://phabricator.wikimedia.org/T423549) (owner: 10Komla Sapaty) [07:39:06] (03PS2) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a14 [vendor] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307059 (https://phabricator.wikimedia.org/T387374) [07:39:24] (03PS4) 10Arnaudb: backup: exclude lucene index from Gerrit backups [puppet] - 10https://gerrit.wikimedia.org/r/1306166 (https://phabricator.wikimedia.org/T257744) [07:39:45] (03CR) 10CI reject: [V:04-1] backup: exclude lucene index from Gerrit backups [puppet] - 10https://gerrit.wikimedia.org/r/1306166 (https://phabricator.wikimedia.org/T257744) (owner: 10Arnaudb) [07:39:47] (03Abandoned) 10Arnaudb: backup: exclude lucene index from Gerrit backups [puppet] - 10https://gerrit.wikimedia.org/r/1306166 (https://phabricator.wikimedia.org/T257744) (owner: 10Arnaudb) [07:39:52] !log jmm@cumin2003 START - Cookbook sre.ganeti.reboot-vm for VM urldownloader2005.wikimedia.org [07:39:58] (03Merged) 10jenkins-bot: [REST] Don't language-convert non-parsoid output; don't lookup bogus titles [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307058 (https://phabricator.wikimedia.org/T430778) (owner: 10C. Scott Ananian) [07:40:14] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#12079093 (10ops-monitoring-bot) VM urldownloader2005.wikimedia.org rebooted by jmm@cumin2003 with reason: bump resources [07:41:06] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a14 [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307061 (https://phabricator.wikimedia.org/T430501) [07:41:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307061 (https://phabricator.wikimedia.org/T430501) (owner: 10C. Scott Ananian) [07:41:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [vendor] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307059 (https://phabricator.wikimedia.org/T387374) (owner: 10C. Scott Ananian) [07:42:16] (03Merged) 10jenkins-bot: [parser] When expanding an extension tag with a title, use a new frame [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306996 (https://phabricator.wikimedia.org/T430344) (owner: 10C. Scott Ananian) [07:42:45] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1307058|[REST] Don't language-convert non-parsoid output; don't lookup bogus titles (T430778)]], [[gerrit:1306996|[parser] When expanding an extension tag with a title, use a new frame (T430344 T429624)]] [07:42:53] T430778: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent a proper page, but merely a link target. - https://phabricator.wikimedia.org/T430778 [07:42:53] T430344: Parsoid displays first used template as page title (instead of page title) - https://phabricator.wikimedia.org/T430344 [07:42:54] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [07:43:25] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1306979 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [07:44:15] !log installing node-lodash security updates [07:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:22] !log jmm@cumin2003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM urldownloader2005.wikimedia.org [07:44:51] !log cscott@deploy1003 cscott: Backport for [[gerrit:1307058|[REST] Don't language-convert non-parsoid output; don't lookup bogus titles (T430778)]], [[gerrit:1306996|[parser] When expanding an extension tag with a title, use a new frame (T430344 T429624)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:49:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [07:51:27] FIRING: CertAlmostExpired: Certificate for service releases2003:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#releases2003:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:54:46] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:54:57] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:54:58] !log jmm@cumin2003 START - Cookbook sre.ganeti.reboot-vm for VM urldownloader2006.wikimedia.org [07:55:17] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [07:55:21] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#12079161 (10ops-monitoring-bot) VM urldownloader2006.wikimedia.org rebooted by jmm@cumin2003 with reason: bump resources [07:55:27] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [07:55:50] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [07:56:01] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [07:56:13] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [07:56:46] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:57:24] !log cscott@deploy1003 cscott: Continuing with deployment [07:58:08] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [07:59:27] !log jmm@cumin2003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM urldownloader2006.wikimedia.org [07:59:32] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [07:59:42] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [07:59:51] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [08:00:02] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:00:05] andre and brennen: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T0800). [08:00:05] o/ [08:00:07] * andre waiting until the backports finish [08:00:09] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:00:43] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:00:57] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:01:19] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:01:30] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:01:43] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1307058|[REST] Don't language-convert non-parsoid output; don't lookup bogus titles (T430778)]], [[gerrit:1306996|[parser] When expanding an extension tag with a title, use a new frame (T430344 T429624)]] (duration: 18m 58s) [08:01:49] T430778: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent a proper page, but merely a link target. - https://phabricator.wikimedia.org/T430778 [08:01:50] T430344: Parsoid displays first used template as page title (instead of page title) - https://phabricator.wikimedia.org/T430344 [08:01:50] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [08:02:11] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:02:20] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:03:35] (03PS1) 10Majavah: P:diffscan: Do not scan v6 networks [puppet] - 10https://gerrit.wikimedia.org/r/1307065 (https://phabricator.wikimedia.org/T265329) [08:04:34] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8897/co" [puppet] - 10https://gerrit.wikimedia.org/r/1307065 (https://phabricator.wikimedia.org/T265329) (owner: 10Majavah) [08:04:37] cscott: Hi, are you finished with backports, or is there more? Asking as the train window has started. [08:04:55] just one more, if that's ok [08:05:13] cscott, sure, go ahead [08:05:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307059 (https://phabricator.wikimedia.org/T387374) (owner: 10C. Scott Ananian) [08:05:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307061 (https://phabricator.wikimedia.org/T430501) (owner: 10C. Scott Ananian) [08:05:40] (03CR) 10Filippo Giunchedi: [C:03+2] admin: add rscout to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1306500 (https://phabricator.wikimedia.org/T430594) (owner: 10Filippo Giunchedi) [08:05:45] (i did backport the fix for T430778, which was a train blocker) [08:06:22] (03PS1) 10Majavah: diffscan: Inline template [puppet] - 10https://gerrit.wikimedia.org/r/1307066 [08:07:08] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8898/co" [puppet] - 10https://gerrit.wikimedia.org/r/1307066 (owner: 10Majavah) [08:07:26] cscott: thank you! <3 [08:07:32] (03PS6) 10Btullis: topolvm: tighten controller RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305975 (https://phabricator.wikimedia.org/T429331) [08:07:32] (03PS6) 10Btullis: topolvm: scrape controller/node metrics via prometheus.io annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306222 (https://phabricator.wikimedia.org/T429331) [08:07:32] (03PS7) 10Btullis: topolvm-crds: add the TopoLVM CRD for version 0.38.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305976 (https://phabricator.wikimedia.org/T429331) [08:07:32] (03PS6) 10Btullis: admin_ng: define the topolvm CSI releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306223 (https://phabricator.wikimedia.org/T429331) [08:07:33] (03PS7) 10Btullis: admin_ng: enable the topolvm CSI driver on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305978 (https://phabricator.wikimedia.org/T429331) [08:07:43] (03PS2) 10Filippo Giunchedi: admin: add rscout to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1306500 (https://phabricator.wikimedia.org/T430594) [08:08:08] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] admin: add rscout to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1306500 (https://phabricator.wikimedia.org/T430594) (owner: 10Filippo Giunchedi) [08:08:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12079217 (10ayounsi) [08:08:37] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1244: Migration of db1244.eqiad.wmnet completed [08:08:38] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [08:09:12] (03PS2) 10Majavah: P:diffscan: Do not scan v6 networks [puppet] - 10https://gerrit.wikimedia.org/r/1307065 (https://phabricator.wikimedia.org/T265329) [08:09:12] (03PS2) 10Majavah: diffscan: Inline template [puppet] - 10https://gerrit.wikimedia.org/r/1307066 [08:09:16] (03PS4) 10Elukey: spicerack: add management/config.yaml structure [puppet] - 10https://gerrit.wikimedia.org/r/1306874 (https://phabricator.wikimedia.org/T429699) [08:10:04] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8899/co" [puppet] - 10https://gerrit.wikimedia.org/r/1307065 (https://phabricator.wikimedia.org/T265329) (owner: 10Majavah) [08:10:19] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Diffscan: investigate IPv6 support and explore other scanning tooling - https://phabricator.wikimedia.org/T265329#12079221 (10taavi) a:03taavi [08:10:57] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Diffscan: investigate IPv6 support and explore other scanning tooling - https://phabricator.wikimedia.org/T265329#12079222 (10taavi) a:05taavi→03None [08:10:58] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8900/co" [puppet] - 10https://gerrit.wikimedia.org/r/1307066 (owner: 10Majavah) [08:11:42] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for rscout - https://phabricator.wikimedia.org/T430594#12079227 (10fgiunchedi) [08:12:12] (03CR) 10Jcrespo: [C:03+1] "This is ok, but will need grant deploy, I will take care of it." [puppet] - 10https://gerrit.wikimedia.org/r/1307054 (https://phabricator.wikimedia.org/T408772) (owner: 10Marostegui) [08:12:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for rscout - https://phabricator.wikimedia.org/T430594#12079228 (10fgiunchedi) 05In progress→03Resolved a:03fgiunchedi All steps done, I'm tentatively resolving though @Rscout please reach out and reopen if something... [08:12:59] (03CR) 10Elukey: [C:03+2] spicerack: add management/config.yaml structure (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1306874 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [08:13:20] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a14 [vendor] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307059 (https://phabricator.wikimedia.org/T387374) (owner: 10C. Scott Ananian) [08:13:30] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a14 [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307061 (https://phabricator.wikimedia.org/T430501) (owner: 10C. Scott Ananian) [08:13:59] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1307059|Bump wikimedia/parsoid to 0.24.0-a14 (T387374 T430186 T430367 T430501)]], [[gerrit:1307061|Bump wikimedia/parsoid to 0.24.0-a14 (T430501)]] [08:14:09] T387374: Compound templates prevent section edit links where legacy adds them - https://phabricator.wikimedia.org/T387374 [08:14:09] T430186: Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at start of string - https://phabricator.wikimedia.org/T430186 [08:14:10] T430367: Parsoid adds section edit links to post-expansion where the preprocessor omitted them - https://phabricator.wikimedia.org/T430367 [08:14:10] T430501: CTT tasks week of 2026-06-26 - https://phabricator.wikimedia.org/T430501 [08:16:04] !log cscott@deploy1003 cscott: Backport for [[gerrit:1307059|Bump wikimedia/parsoid to 0.24.0-a14 (T387374 T430186 T430367 T430501)]], [[gerrit:1307061|Bump wikimedia/parsoid to 0.24.0-a14 (T430501)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:21:27] !log cscott@deploy1003 cscott: Continuing with deployment [08:25:44] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1307059|Bump wikimedia/parsoid to 0.24.0-a14 (T387374 T430186 T430367 T430501)]], [[gerrit:1307061|Bump wikimedia/parsoid to 0.24.0-a14 (T430501)]] (duration: 11m 44s) [08:25:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12079304 (10ayounsi) [08:25:52] T387374: Compound templates prevent section edit links where legacy adds them - https://phabricator.wikimedia.org/T387374 [08:25:52] T430186: Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at start of string - https://phabricator.wikimedia.org/T430186 [08:25:53] T430367: Parsoid adds section edit links to post-expansion where the preprocessor omitted them - https://phabricator.wikimedia.org/T430367 [08:25:53] T430501: CTT tasks week of 2026-06-26 - https://phabricator.wikimedia.org/T430501 [08:26:09] andre: ok, i'm all done. thanks for your patience! [08:26:19] cscott: thanks a lot! no problem!~ [08:26:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12079312 (10ayounsi) [08:27:49] (03CR) 10Hashar: "The Gitlab runners have a similar pattern 0f30707ef7c460702ae8e7f3cf4ead5e8056a72b which was introduced for Zuul. I have commented further" [puppet] - 10https://gerrit.wikimedia.org/r/1306952 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [08:28:54] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1307068 (https://phabricator.wikimedia.org/T430912) [08:33:30] (03PS1) 10TrainBranchBot: group2 to 1.47.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307069 (https://phabricator.wikimedia.org/T423918) [08:33:33] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307069 (https://phabricator.wikimedia.org/T423918) (owner: 10TrainBranchBot) [08:34:30] (03Merged) 10jenkins-bot: group2 to 1.47.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307069 (https://phabricator.wikimedia.org/T423918) (owner: 10TrainBranchBot) [08:34:47] (03PS1) 10Btullis: datahub-next: allow egress to the in-cluster OpenSearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307072 (https://phabricator.wikimedia.org/T402408) [08:34:49] (03PS1) 10Btullis: datahub-next: reach OpenSearch in plaintext via the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307073 (https://phabricator.wikimedia.org/T402408) [08:35:01] (03CR) 10Jcrespo: [C:03+1] "Actually, being a read-only host, we will keep it without, so no accidental backup happens. This can be merged now." [puppet] - 10https://gerrit.wikimedia.org/r/1307054 (https://phabricator.wikimedia.org/T408772) (owner: 10Marostegui) [08:35:17] (03PS1) 10Federico Ceratto: admin: Rotate yubico pubkey [puppet] - 10https://gerrit.wikimedia.org/r/1307070 [08:39:04] (03CR) 10Btullis: [C:03+2] datahub-next: allow egress to the in-cluster OpenSearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307072 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [08:39:39] (03PS7) 10Btullis: topolvm: tighten controller RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305975 (https://phabricator.wikimedia.org/T429331) [08:39:40] (03PS7) 10Btullis: topolvm: scrape controller/node metrics via prometheus.io annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306222 (https://phabricator.wikimedia.org/T429331) [08:39:40] (03PS8) 10Btullis: topolvm-crds: add the TopoLVM CRD for version 0.38.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305976 (https://phabricator.wikimedia.org/T429331) [08:39:40] (03PS7) 10Btullis: admin_ng: define the topolvm CSI releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306223 (https://phabricator.wikimedia.org/T429331) [08:39:41] (03PS8) 10Btullis: admin_ng: enable the topolvm CSI driver on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305978 (https://phabricator.wikimedia.org/T429331) [08:40:41] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.47.0-wmf.9 refs T423918 [08:40:44] T423918: 1.47.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T423918 [08:41:22] (03Merged) 10jenkins-bot: datahub-next: allow egress to the in-cluster OpenSearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307072 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [08:43:29] (03PS1) 10Kosta Harlan: build: Update required Node version from 24.14.1 to 24.18.0 [extensions/MobileFrontend] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307075 [08:43:36] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [08:43:44] (03PS1) 10Kosta Harlan: SourceEditorOverlay: Re-enable buttons after non-captcha save failure [extensions/MobileFrontend] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307076 (https://phabricator.wikimedia.org/T430518) [08:44:11] jouncebot: nowandnext [08:44:11] For the next 1 hour(s) and 15 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T0800) [08:44:11] In 1 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T1000) [08:44:37] andre: I have a bug fix for wmf.9 that I'd like to deploy when you're done with the train [08:45:29] kostajh: I have deployed the train, but I see some auth errors which make me a bit nervous [08:45:32] kostajh, But I need to dig a bit more before (potentially) rolling back, so go ahead, I'd say [08:46:09] ok [08:47:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307075 (owner: 10Kosta Harlan) [08:50:26] 06SRE, 06Infrastructure-Foundations: Block FUSE (kernel module/package) on hosts which don't need it - https://phabricator.wikimedia.org/T287753#12079413 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:51:11] (03CR) 10CI reject: [V:04-1] SourceEditorOverlay: Re-enable buttons after non-captcha save failure [extensions/MobileFrontend] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307076 (https://phabricator.wikimedia.org/T430518) (owner: 10Kosta Harlan) [08:52:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance [08:52:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2220 (T426633)', diff saved to https://phabricator.wikimedia.org/P94681 and previous config saved to /var/cache/conftool/dbconfig/20260702-085237-fceratto.json [08:52:44] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [08:52:55] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [08:53:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12079442 (10ayounsi) [08:54:07] (03Merged) 10jenkins-bot: build: Update required Node version from 24.14.1 to 24.18.0 [extensions/MobileFrontend] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307075 (owner: 10Kosta Harlan) [08:54:24] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1307075|build: Update required Node version from 24.14.1 to 24.18.0]] [08:55:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:55:49] FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:56:17] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1307075|build: Update required Node version from 24.14.1 to 24.18.0]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:56:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:57:10] !log kharlan@deploy1003 kharlan: Continuing with deployment [08:57:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:59:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T426633)', diff saved to https://phabricator.wikimedia.org/P94682 and previous config saved to /var/cache/conftool/dbconfig/20260702-085942-fceratto.json [09:00:28] (03CR) 10Kosta Harlan: "recheck" [extensions/MobileFrontend] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307076 (https://phabricator.wikimedia.org/T430518) (owner: 10Kosta Harlan) [09:01:32] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1307075|build: Update required Node version from 24.14.1 to 24.18.0]] (duration: 07m 07s) [09:01:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307076 (https://phabricator.wikimedia.org/T430518) (owner: 10Kosta Harlan) [09:03:05] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [09:03:36] (03CR) 10Majavah: "Hmm, I thought we didn't need to add these for new hosts, only to preserve the old IDs for older hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/1306289 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [09:05:49] RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:09:34] (03Merged) 10jenkins-bot: SourceEditorOverlay: Re-enable buttons after non-captcha save failure [extensions/MobileFrontend] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1307076 (https://phabricator.wikimedia.org/T430518) (owner: 10Kosta Harlan) [09:09:50] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1307076|SourceEditorOverlay: Re-enable buttons after non-captcha save failure (T430518)]] [09:09:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P94683 and previous config saved to /var/cache/conftool/dbconfig/20260702-090950-fceratto.json [09:09:53] T430518: MobileFrontend Source Editor: Cannot publish or go back after triggering AbuseFilter warning consequence - https://phabricator.wikimedia.org/T430518 [09:11:14] (03CR) 10Majavah: [C:03+1] team-wmcs: introduce per-namespace neutron conntrack alert [alerts] - 10https://gerrit.wikimedia.org/r/1302151 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [09:11:41] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1307076|SourceEditorOverlay: Re-enable buttons after non-captcha save failure (T430518)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:12:31] !log kharlan@deploy1003 kharlan: Continuing with deployment [09:13:24] !log installing libgcrypt20 security updates [09:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:57] (03PS1) 10Elukey: profile::spicerack: fix management_config_data handling [puppet] - 10https://gerrit.wikimedia.org/r/1307079 (https://phabricator.wikimedia.org/T429699) [09:14:00] (03PS1) 10Elukey: profile::spicerack: remove unnecessary filter for empty values [puppet] - 10https://gerrit.wikimedia.org/r/1307080 (https://phabricator.wikimedia.org/T429699) [09:16:47] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1307076|SourceEditorOverlay: Re-enable buttons after non-captcha save failure (T430518)]] (duration: 06m 57s) [09:16:50] T430518: MobileFrontend Source Editor: Cannot publish or go back after triggering AbuseFilter warning consequence - https://phabricator.wikimedia.org/T430518 [09:17:22] andre: I'm done with the wmf.9 deployments [09:17:27] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1307081 (https://phabricator.wikimedia.org/T430920) [09:17:41] kostajh: thanks for the notice! [09:19:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P94684 and previous config saved to /var/cache/conftool/dbconfig/20260702-091957-fceratto.json [09:21:15] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1307083 (https://phabricator.wikimedia.org/T430923) [09:24:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T430923 [09:24:39] T430923: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T430923 [09:24:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2192 with weight 0 T430923', diff saved to https://phabricator.wikimedia.org/P94685 and previous config saved to /var/cache/conftool/dbconfig/20260702-092455-fceratto.json [09:26:23] (03CR) 10Jelto: "I'm a bit confused, the commit message is about `wikimedia.cloud.org`, the change adds an exception for `wikimediacloud.org` and the comme" [puppet] - 10https://gerrit.wikimedia.org/r/1306952 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [09:27:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12079630 (10ayounsi) [09:27:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12079635 (10ayounsi) [09:27:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12079637 (10ayounsi) [09:29:54] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1307080 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [09:30:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T426633)', diff saved to https://phabricator.wikimedia.org/P94686 and previous config saved to /var/cache/conftool/dbconfig/20260702-093004-fceratto.json [09:30:51] (03CR) 10Marostegui: [C:03+2] "thanks - btw I didn't find any dumps user on the previous host." [puppet] - 10https://gerrit.wikimedia.org/r/1307054 (https://phabricator.wikimedia.org/T408772) (owner: 10Marostegui) [09:31:58] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1307083 (https://phabricator.wikimedia.org/T430923) (owner: 10Gerrit maintenance bot) [09:33:02] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2241 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1307084 (https://phabricator.wikimedia.org/T430925) [09:34:52] (03CR) 10Filippo Giunchedi: [C:03+2] "My understanding is that we need the ID on reimage to preserve identity, both for new and old cloudvirts. Though I haven't checked what a " [puppet] - 10https://gerrit.wikimedia.org/r/1306289 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [09:34:56] (03CR) 10Jcrespo: [C:03+1] "Yes, I remembered after my initial comment that we don't have those deployed all the time, only when taking backups every 5 years to avoid" [puppet] - 10https://gerrit.wikimedia.org/r/1307054 (https://phabricator.wikimedia.org/T408772) (owner: 10Marostegui) [09:35:00] (03CR) 10Cathal Mooney: [C:03+1] "Thanks taavi, indeed this makes sense. It would be good to have something to check open ports on our hosts but blindly scanning the space" [puppet] - 10https://gerrit.wikimedia.org/r/1307065 (https://phabricator.wikimedia.org/T265329) (owner: 10Majavah) [09:35:26] (03CR) 10Hashar: "My bad I mixed it up, the domain removed is `wikimediacloud.org`. I have replied on my misleading comment https://gerrit.wikimedia.org/r/c" [puppet] - 10https://gerrit.wikimedia.org/r/1306952 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [09:36:04] !log Starting s5 codfw failover from db2213 to db2192 - T430923 [09:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:08] T430923: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T430923 [09:36:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db2192 to s5 primary T430923', diff saved to https://phabricator.wikimedia.org/P94687 and previous config saved to /var/cache/conftool/dbconfig/20260702-093650-fceratto.json [09:39:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool db2213 T430923', diff saved to https://phabricator.wikimedia.org/P94688 and previous config saved to /var/cache/conftool/dbconfig/20260702-093859-fceratto.json [09:39:09] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2213: Repooling after switchover [09:41:26] (03PS1) 10Btullis: Revert "datahub-next: allow egress to the in-cluster OpenSearch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307085 [09:43:38] (03PS1) 10Hnowlan: smart: emit BBU data from megaraid hosts as gauge [puppet] - 10https://gerrit.wikimedia.org/r/1307087 (https://phabricator.wikimedia.org/T430149) [09:44:41] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db2213: Repooling after switchover [09:47:20] (03CR) 10Btullis: [C:03+2] Revert "datahub-next: allow egress to the in-cluster OpenSearch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307085 (owner: 10Btullis) [09:49:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12079790 (10ayounsi) [09:49:26] (03Merged) 10jenkins-bot: Revert "datahub-next: allow egress to the in-cluster OpenSearch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307085 (owner: 10Btullis) [09:50:32] (03PS4) 10Hashar: zuul: remove wikimediacloud.org from no_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1306952 (https://phabricator.wikimedia.org/T430479) [09:51:52] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2213: Repooling after switchover [09:52:59] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2213: Repooling after switchover [09:53:08] (03PS1) 10Marostegui: mariadb: Decommission es1033 [puppet] - 10https://gerrit.wikimedia.org/r/1307089 (https://phabricator.wikimedia.org/T408772) [09:53:42] (03CR) 10Hashar: "I have amended the commit message with a lot more explanation." [puppet] - 10https://gerrit.wikimedia.org/r/1306952 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [09:54:23] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [09:55:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Maintenance [09:55:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2213 (T426633)', diff saved to https://phabricator.wikimedia.org/P94690 and previous config saved to /var/cache/conftool/dbconfig/20260702-095529-fceratto.json [09:56:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12079820 (10ayounsi) [09:59:15] (03CR) 10Jelto: "ack thanks for the clarification! I left one comment in the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1306952 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T1000) [10:00:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12079859 (10ayounsi) [10:01:19] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1307079 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [10:01:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T426633)', diff saved to https://phabricator.wikimedia.org/P94691 and previous config saved to /var/cache/conftool/dbconfig/20260702-100122-fceratto.json [10:02:04] (03PS17) 10FNegri: sre.mysql.multiinstance_reboot: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [10:02:04] (03PS1) 10FNegri: tox.ini: allow running a single test [cookbooks] - 10https://gerrit.wikimedia.org/r/1307091 [10:02:50] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [10:03:03] !log fceratto@cumin1003 START - Cookbook sre.mysql.decommission [10:03:42] (03CR) 10Majavah: "We need a stable ID, but `openstack::nova::compute::service` will generate one for hosts where the Hiera key is not set. The hiera key was" [puppet] - 10https://gerrit.wikimedia.org/r/1306289 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [10:03:56] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1033.eqiad.wmnet [10:04:26] (03PS1) 10Hnowlan: icinga: clean up disabled librenms components [puppet] - 10https://gerrit.wikimedia.org/r/1307092 (https://phabricator.wikimedia.org/T281095) [10:04:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [10:05:19] (03CR) 10Majavah: [V:03+1 C:03+2] P:diffscan: Do not scan v6 networks [puppet] - 10https://gerrit.wikimedia.org/r/1307065 (https://phabricator.wikimedia.org/T265329) (owner: 10Majavah) [10:05:57] (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2026-07-02-100212-production [puppet] - 10https://gerrit.wikimedia.org/r/1307093 [10:06:32] (03PS3) 10Clément Goubert: service: ipoid to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1306899 (https://phabricator.wikimedia.org/T416623) [10:06:32] (03PS5) 10Clément Goubert: deployment_server: absent ipoid kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/1306779 (https://phabricator.wikimedia.org/T416623) (owner: 10Dreamy Jazz) [10:06:32] (03PS6) 10Clément Goubert: deployment_server: remove ipoid users [puppet] - 10https://gerrit.wikimedia.org/r/1306782 (https://phabricator.wikimedia.org/T416623) (owner: 10Dreamy Jazz) [10:06:32] (03PS3) 10Clément Goubert: service: Remove ipoid service [puppet] - 10https://gerrit.wikimedia.org/r/1306900 (https://phabricator.wikimedia.org/T416623) [10:07:00] (03PS1) 10Clément Goubert: service: ipoid to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1307094 (https://phabricator.wikimedia.org/T416623) [10:07:04] (03CR) 10Ayounsi: [C:03+1] diffscan: Inline template [puppet] - 10https://gerrit.wikimedia.org/r/1307066 (owner: 10Majavah) [10:07:41] (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2026-07-02-100212-production [puppet] - 10https://gerrit.wikimedia.org/r/1307093 (owner: 10Majavah) [10:07:49] (03CR) 10Majavah: [V:03+1 C:03+2] diffscan: Inline template [puppet] - 10https://gerrit.wikimedia.org/r/1307066 (owner: 10Majavah) [10:07:50] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [10:09:35] (03CR) 10Elukey: [C:03+2] profile::spicerack: fix management_config_data handling [puppet] - 10https://gerrit.wikimedia.org/r/1307079 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [10:10:12] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [10:10:40] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [10:11:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P94693 and previous config saved to /var/cache/conftool/dbconfig/20260702-101130-fceratto.json [10:12:23] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2214: Repooling [10:12:54] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2214.codfw.wmnet [10:12:54] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2214.codfw.wmnet [10:13:27] !log fnegri@cumin1003 START - Cookbook sre.mysql.multiinstance_reboot for clouddb1017.eqiad.wmnet [10:14:25] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1033.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [10:14:51] (03PS1) 10Majavah: hieradata: Update striker-tools to 2026-07-02-100212-production [puppet] - 10https://gerrit.wikimedia.org/r/1307097 [10:14:57] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1033.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [10:14:57] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:14:59] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1033.eqiad.wmnet [10:16:06] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission es1033 [puppet] - 10https://gerrit.wikimedia.org/r/1307089 (https://phabricator.wikimedia.org/T408772) (owner: 10Marostegui) [10:16:17] (03CR) 10Majavah: [C:03+2] hieradata: Update striker-tools to 2026-07-02-100212-production [puppet] - 10https://gerrit.wikimedia.org/r/1307097 (owner: 10Majavah) [10:17:59] fceratto@cumin1003 decommission (PID 914846) is awaiting input [10:18:06] !log fceratto@cumin1003 Removing es1033 from zarcillo T408772 [10:18:09] T408772: decommission es1033.eqiad.wmnet - https://phabricator.wikimedia.org/T408772 [10:18:09] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.decommission (exit_code=0) [10:19:40] (03CR) 10Jelto: [C:03+1] "lgtm, let me know when this should be depoyed. cc @dzahn@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1306952 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [10:19:53] !log fnegri@cumin1003 END (PASS) - Cookbook sre.mysql.multiinstance_reboot (exit_code=0) for clouddb1017.eqiad.wmnet [10:20:10] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1033.eqiad.wmnet - https://phabricator.wikimedia.org/T408772#12079948 (10Marostegui) This host is ready for DC-Ops to decommission [10:20:18] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1033.eqiad.wmnet - https://phabricator.wikimedia.org/T408772#12079951 (10Marostegui) a:05Marostegui→03None [10:20:35] 10SRE-tools, 06Infrastructure-Foundations: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#12079954 (10ayounsi) [10:20:47] (03CR) 10Marostegui: "The test went fine for https://phabricator.wikimedia.org/T408772" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [10:20:50] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [10:21:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P94696 and previous config saved to /var/cache/conftool/dbconfig/20260702-102137-fceratto.json [10:22:40] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:49] (03PS2) 10Elukey: profile::spicerack: remove unnecessary filter for empty values [puppet] - 10https://gerrit.wikimedia.org/r/1307080 (https://phabricator.wikimedia.org/T429699) [10:24:49] (03PS1) 10Elukey: profile::spicerack: fix (again) configuration_data_management [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) [10:26:08] (03PS8) 10Btullis: admin_ng: define the topolvm CSI releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306223 (https://phabricator.wikimedia.org/T429331) [10:26:08] (03PS9) 10Btullis: admin_ng: enable the topolvm CSI driver on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305978 (https://phabricator.wikimedia.org/T429331) [10:26:16] (03PS1) 10Atsuko: translate: add lab endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307099 (https://phabricator.wikimedia.org/T430882) [10:26:46] (03PS1) 10Dreamy Jazz: tables-catalog: Mark change_tag_def and change_tag partially public [puppet] - 10https://gerrit.wikimedia.org/r/1307100 (https://phabricator.wikimedia.org/T386456) [10:26:53] (03PS2) 10Elukey: profile::spicerack: fix (again) configuration_data_management [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) [10:26:53] (03PS3) 10Elukey: profile::spicerack: remove unnecessary filter for empty values [puppet] - 10https://gerrit.wikimedia.org/r/1307080 (https://phabricator.wikimedia.org/T429699) [10:27:08] (03CR) 10Cathal Mooney: [C:03+2] LVS: add public vlan IPs/subnets for LVS still connected to L2 vlans [puppet] - 10https://gerrit.wikimedia.org/r/1306690 (https://phabricator.wikimedia.org/T430651) (owner: 10Cathal Mooney) [10:27:54] (03PS3) 10Elukey: profile::spicerack: fix (again) configuration_data_management [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) [10:27:54] (03PS4) 10Elukey: profile::spicerack: remove unnecessary filter for empty values [puppet] - 10https://gerrit.wikimedia.org/r/1307080 (https://phabricator.wikimedia.org/T429699) [10:28:05] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [10:31:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T426633)', diff saved to https://phabricator.wikimedia.org/P94698 and previous config saved to /var/cache/conftool/dbconfig/20260702-103146-fceratto.json [10:33:26] (03PS1) 10Muehlenhoff: Depool puppetserver1003/2004 for reboots [dns] - 10https://gerrit.wikimedia.org/r/1307101 [10:35:17] (03CR) 10Hashar: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1306952 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [10:35:19] (03PS4) 10Elukey: profile::spicerack: fix (again) configuration_data_management [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) [10:35:19] (03PS5) 10Elukey: profile::spicerack: remove unnecessary filter for empty values [puppet] - 10https://gerrit.wikimedia.org/r/1307080 (https://phabricator.wikimedia.org/T429699) [10:35:59] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [10:37:34] (03PS1) 10Hnowlan: turnilo: migrate to use blackbox check to replace nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/1307102 (https://phabricator.wikimedia.org/T407117) [10:37:37] (03PS1) 10Hnowlan: turnilo: remove all monitoring cruft [puppet] - 10https://gerrit.wikimedia.org/r/1307103 (https://phabricator.wikimedia.org/T407117) [10:39:10] (03CR) 10Hnowlan: [C:03+2] restabase: remove instance space icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1305084 (https://phabricator.wikimedia.org/T407141) (owner: 10Tiziano Fogli) [10:41:02] (03PS5) 10Tiziano Fogli: restabase: remove instance space icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1305084 (https://phabricator.wikimedia.org/T407141) [10:41:29] (03Abandoned) 10Hnowlan: restabase: remove instance space icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1305084 (https://phabricator.wikimedia.org/T407141) (owner: 10Tiziano Fogli) [10:42:06] (03PS5) 10Elukey: profile::spicerack: fix (again) configuration_data_management [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) [10:42:06] (03PS6) 10Elukey: profile::spicerack: remove unnecessary filter for empty values [puppet] - 10https://gerrit.wikimedia.org/r/1307080 (https://phabricator.wikimedia.org/T429699) [10:42:28] (03CR) 10Jelto: [C:03+2] zuul: remove wikimediacloud.org from no_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1306952 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [10:42:31] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [10:43:25] (03PS2) 10Hnowlan: restbase: move nrpe check to prom blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1306661 (https://phabricator.wikimedia.org/T407141) [10:46:57] (03CR) 10Muehlenhoff: [C:03+2] Depool puppetserver1003/2004 for reboots [dns] - 10https://gerrit.wikimedia.org/r/1307101 (owner: 10Muehlenhoff) [10:47:03] !log jmm@dns1004 START - running authdns-update [10:48:01] (03CR) 10Btullis: topolvm: import the upstream chart version 15.7.1 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305973 (https://phabricator.wikimedia.org/T429331) (owner: 10Btullis) [10:49:10] !log jmm@dns1004 END - running authdns-update [10:49:36] (03PS1) 10Hnowlan: cassandra: remove obsolete expiry check [puppet] - 10https://gerrit.wikimedia.org/r/1307104 (https://phabricator.wikimedia.org/T407117) [10:49:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307099 (https://phabricator.wikimedia.org/T430882) (owner: 10Atsuko) [10:49:56] (03CR) 10Hashar: [C:03+1] "Changes to the beta cluster can be landed anytime given they don't afaik production. scap / spiderpig should be smart enough to detect the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307099 (https://phabricator.wikimedia.org/T430882) (owner: 10Atsuko) [10:50:21] (03CR) 10Elukey: [C:03+1] docker-registry: Allow image builds to be pushed from build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1306936 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [10:50:31] jouncebot: nowandnext [10:50:31] For the next 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T1000) [10:50:31] In 1 hour(s) and 9 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T1200) [10:51:22] atsukoito it seems it is quite now with no ongoing deploy [10:51:54] I +1ed your change for the beta cluster ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1307099 ) , it only touches wmf-config/LabsService.php so scap/spiderpig should not sync production [10:52:08] and a hotfix to unbreak beta qualifies for immediate deployment imho :] [10:54:09] atsukoito: do you want to self deploy it or should I drive it through spiderpig? [10:54:34] i'm logging it to the spiderpig, should be able to do it myself [10:55:03] (03PS6) 10Elukey: profile::spicerack: fix (again) configuration_data_management [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) [10:55:03] (03PS7) 10Elukey: profile::spicerack: remove unnecessary filter for empty values [puppet] - 10https://gerrit.wikimedia.org/r/1307080 (https://phabricator.wikimedia.org/T429699) [10:55:14] <3 [10:55:47] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [10:55:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by atsuko@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307099 (https://phabricator.wikimedia.org/T430882) (owner: 10Atsuko) [10:56:50] (03Merged) 10jenkins-bot: translate: add lab endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307099 (https://phabricator.wikimedia.org/T430882) (owner: 10Atsuko) [10:57:51] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2214: Repooling [10:58:35] (03PS1) 10Gkyziridis: ml-services: Qwen36-27b test CUDA graphs on 1013/1014 with raised timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307105 [10:58:44] (03CR) 10Elukey: [C:03+2] profile::spicerack: fix (again) configuration_data_management [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [10:58:50] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [11:00:48] is gerrit read-only? [11:01:41] atsukoito: A DB server went down, being failed over [11:01:42] phab is giving me `#1290: The MariaDB server is running with the --read-only option so it cannot execute this statement` [11:01:45] (ah) [11:02:38] (03PS1) 10Marostegui: Revert "mariadb: Promote db1228 to m3 master" [puppet] - 10https://gerrit.wikimedia.org/r/1307106 [11:02:45] TheresNoTime: on it [11:02:58] (gl!) [11:03:01] (03CR) 10CI reject: [V:04-1] Revert "mariadb: Promote db1228 to m3 master" [puppet] - 10https://gerrit.wikimedia.org/r/1307106 (owner: 10Marostegui) [11:03:50] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [11:05:50] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [11:06:19] (03CR) 10Elukey: [C:03+2] "Post-merge comment: I self-merged since the code was the same of the original code review, before applying Jesse's type comments about Sen" [puppet] - 10https://gerrit.wikimedia.org/r/1307098 (https://phabricator.wikimedia.org/T429699) (owner: 10Elukey) [11:06:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:13:03] !incidents [11:13:03] 8123 (ACKED) [3x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad) [11:13:03] 8134 (ACKED) Host 10.3.0.1 [11:13:04] 8135 (ACKED) Host cloudelastic.wikimedia.org [11:13:04] 8146 (UNACKED) db1228 (paged)/MariaDB read only m3 (paged) [11:13:04] 8145 (RESOLVED) Host db1228 (paged) [11:13:04] 8144 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl2004:6443 probes/custom codfw) [11:13:05] 8143 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [11:13:05] 8127 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [11:13:05] 8126 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule@main) [11:13:06] 8124 (RESOLVED) [50x] ProbeDown sre () [11:13:06] 8141 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [11:13:11] !ack [11:13:12] 8146 (ACKED) db1228 (paged)/MariaDB read only m3 (paged) [11:13:19] !resolve 8146 [11:13:19] 8146 (RESOLVED) db1228 (paged)/MariaDB read only m3 (paged) [11:13:20] (03PS1) 10Marostegui: mariadb: Make db1250 m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1307108 [11:13:43] (03CR) 10Marostegui: [V:03+2 C:03+2] mariadb: Make db1250 m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1307108 (owner: 10Marostegui) [11:15:48] (03CR) 10Muehlenhoff: [C:03+2] docker-registry: Allow image builds to be pushed from build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1306936 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [11:15:50] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [11:16:37] (03PS1) 10Marostegui: mariadb: Make db1250 m3 master in dbproxy* [puppet] - 10https://gerrit.wikimedia.org/r/1307109 [11:17:46] (03CR) 10Marostegui: [C:03+2] mariadb: Make db1250 m3 master in dbproxy* [puppet] - 10https://gerrit.wikimedia.org/r/1307109 (owner: 10Marostegui) [11:20:22] (03PS1) 10Marostegui: db1228: No longer critical [puppet] - 10https://gerrit.wikimedia.org/r/1307110 [11:21:51] 10ops-eqiad, 06DBA, 06DC-Ops: db1228 crashed - https://phabricator.wikimedia.org/T430934 (10Marostegui) 03NEW [11:21:56] (03PS1) 10Cathal Mooney: Clouddumps fw rules for lvs healthcheck - remove production networks [puppet] - 10https://gerrit.wikimedia.org/r/1307111 (https://phabricator.wikimedia.org/T430651) [11:24:34] 10ops-eqiad, 06DBA, 06DC-Ops: db1228 crashed - https://phabricator.wikimedia.org/T430934#12080080 (10Marostegui) p:05Triage→03High This was m3 master and I had to switch it back to db1250 (which was switched a few hours earlier today). [11:25:04] !incidents [11:25:05] 8123 (ACKED) [3x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad) [11:25:05] 8134 (ACKED) Host 10.3.0.1 [11:25:05] 8135 (ACKED) Host cloudelastic.wikimedia.org [11:25:05] 8146 (RESOLVED) db1228 (paged)/MariaDB read only m3 (paged) [11:25:05] 8145 (RESOLVED) Host db1228 (paged) [11:25:06] 8144 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl2004:6443 probes/custom codfw) [11:25:06] 8143 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [11:25:07] 8127 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [11:25:07] 8126 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule@main) [11:25:08] 8124 (RESOLVED) [50x] ProbeDown sre () [11:25:08] 8141 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [11:25:46] (03CR) 10Marostegui: [C:03+2] db1228: No longer critical [puppet] - 10https://gerrit.wikimedia.org/r/1307110 (owner: 10Marostegui) [11:26:12] (03Abandoned) 10Marostegui: Revert "mariadb: Promote db1228 to m3 master" [puppet] - 10https://gerrit.wikimedia.org/r/1307106 (owner: 10Marostegui) [11:29:19] !log jmm@cumin2003 START - Cookbook sre.puppet.disable-merges [11:29:20] !log jmm@cumin2003 END (PASS) - Cookbook sre.puppet.disable-merges (exit_code=0) [11:30:44] !log jmm@cumin2003 START - Cookbook sre.hosts.reboot-single for host puppetserver2004.codfw.wmnet [11:30:50] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [11:31:01] (03CR) 10Kamila Součková: [C:03+2] mediawiki/php: fix apt component for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1306979 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [11:31:03] (03CR) 10Blake: [C:03+2] services: Add a new mw-pretrain k8s service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306878 (owner: 10Blake) [11:33:13] (03Merged) 10jenkins-bot: services: Add a new mw-pretrain k8s service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306878 (owner: 10Blake) [11:34:57] (03CR) 10Krinkle: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1306230 (https://phabricator.wikimedia.org/T427623) (owner: 10Krinkle) [11:35:50] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [11:36:44] !log jmm@cumin2003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2004.codfw.wmnet [11:36:50] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [11:37:26] !log jmm@cumin2003 START - Cookbook sre.hosts.reboot-single for host puppetserver1003.eqiad.wmnet [11:39:28] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts wdqs-categories1001.eqiad.wmnet [11:41:15] (03PS1) 10Btullis: datahub: rename config to datahub_config in subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307113 (https://phabricator.wikimedia.org/T402408) [11:41:18] (03PS1) 10Btullis: datahub-next: rename config to datahub_config in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307114 (https://phabricator.wikimedia.org/T402408) [11:41:20] (03PS1) 10Btullis: datahub: align upgrade-job ES auth with the subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307115 (https://phabricator.wikimedia.org/T402408) [11:41:23] (03PS1) 10Btullis: datahub-next: authenticate to OpenSearch and verify its cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307116 (https://phabricator.wikimedia.org/T402408) [11:41:43] !log jmm@cumin2003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1003.eqiad.wmnet [11:42:08] !log jmm@cumin2003 START - Cookbook sre.puppet.disable-merges [11:42:09] !log jmm@cumin2003 END (PASS) - Cookbook sre.puppet.disable-merges (exit_code=0) [11:44:18] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [11:46:50] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [11:47:20] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [11:50:10] btullis@cumin1003 decommission (PID 982765) is awaiting input [11:51:53] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs-categories1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [11:54:28] (03PS1) 10Majavah: hieradata: Indicate cloudweb2002-dev does not need depooling [puppet] - 10https://gerrit.wikimedia.org/r/1307118 (https://phabricator.wikimedia.org/T430918) [11:54:58] btullis@cumin1003 decommission (PID 982765) is awaiting input [11:55:58] 10SRE-Access-Requests: Superset data access request - https://phabricator.wikimedia.org/T430938 (10ABendall-WMF) 03NEW [11:56:39] (03PS1) 10Muehlenhoff: Revert "Depool puppetserver1003/2004 for reboots" [dns] - 10https://gerrit.wikimedia.org/r/1307119 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T1200) [12:01:23] (03PS1) 10Majavah: hieradata: Use wildcard certificates for openstack APIs [puppet] - 10https://gerrit.wikimedia.org/r/1307120 (https://phabricator.wikimedia.org/T377055) [12:02:20] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [12:03:20] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [12:07:51] (03CR) 10Muehlenhoff: [C:03+2] Revert "Depool puppetserver1003/2004 for reboots" [dns] - 10https://gerrit.wikimedia.org/r/1307119 (owner: 10Muehlenhoff) [12:07:55] !log jmm@dns1004 START - running authdns-update [12:10:03] !log jmm@dns1004 END - running authdns-update [12:19:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs-categories1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [12:19:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs-categories1001.eqiad.wmnet [12:20:23] (03CR) 10CDobbins: [C:03+2] varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [12:21:35] (03PS1) 10Kamila Součková: pontoon: new stack raine-bookwormupgrade [puppet] - 10https://gerrit.wikimedia.org/r/1307125 [12:21:35] (03PS1) 10Kamila Součková: pontoon: add rolegroup bootstrap to raine-bookwormupgrade [puppet] - 10https://gerrit.wikimedia.org/r/1307126 [12:21:35] (03PS1) 10Kamila Součková: pontoon: add deploy host [puppet] - 10https://gerrit.wikimedia.org/r/1307127 [12:21:35] (03PS1) 10Kamila Součková: mediawiki/php: don't pin libpcre for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1307128 (https://phabricator.wikimedia.org/T423714) [12:22:09] ffs gitbutler keeps going from being awesome to being terrible [12:22:29] (03Abandoned) 10Kamila Součková: pontoon: new stack raine-bookwormupgrade [puppet] - 10https://gerrit.wikimedia.org/r/1307125 (owner: 10Kamila Součková) [12:22:38] (03Abandoned) 10Kamila Součková: pontoon: add rolegroup bootstrap to raine-bookwormupgrade [puppet] - 10https://gerrit.wikimedia.org/r/1307126 (owner: 10Kamila Součková) [12:22:46] (03Abandoned) 10Kamila Součková: pontoon: add deploy host [puppet] - 10https://gerrit.wikimedia.org/r/1307127 (owner: 10Kamila Součková) [12:23:40] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307128 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [12:27:47] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1307129 (owner: 10L10n-bot) [12:32:30] FIRING: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:44:49] (03CR) 10Elukey: [C:03+2] profile::cache::haproxy: change webrequest top 10k IPs map name [puppet] - 10https://gerrit.wikimedia.org/r/1306545 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [12:48:20] (03CR) 10Tiziano Fogli: restbase: move nrpe check to prom blackbox check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1306661 (https://phabricator.wikimedia.org/T407141) (owner: 10Hnowlan) [12:48:32] Raine: TIL [12:52:25] RESOLVED: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:30] RESOLVED: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:53:01] (03PS2) 10Dreamy Jazz: maintain-views: Make change_tag_def and change_tag partially public [puppet] - 10https://gerrit.wikimedia.org/r/1307100 (https://phabricator.wikimedia.org/T386456) [12:53:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:22] (03PS1) 10Filippo Giunchedi: Revert "dumps: temp allow production_networks for nfs healthchecks" [puppet] - 10https://gerrit.wikimedia.org/r/1307141 [12:54:30] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] Revert "dumps: temp allow production_networks for nfs healthchecks" [puppet] - 10https://gerrit.wikimedia.org/r/1307141 (owner: 10Filippo Giunchedi) [12:55:00] (03PS1) 10Elukey: profile::cache::haproxy: remove .txt suffix from the webrequest sources [puppet] - 10https://gerrit.wikimedia.org/r/1307142 (https://phabricator.wikimedia.org/T402512) [12:55:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:55:41] (03CR) 10Fabfur: [C:03+1] profile::cache::haproxy: remove .txt suffix from the webrequest sources [puppet] - 10https://gerrit.wikimedia.org/r/1307142 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [12:56:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:56:27] (03CR) 10Filippo Giunchedi: [C:03+2] "Chatted on IRC, followup/cleanup is https://phabricator.wikimedia.org/T430933" [puppet] - 10https://gerrit.wikimedia.org/r/1306289 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [12:56:57] (03CR) 10Filippo Giunchedi: "Thank you, I didn't see this patch and reverted already with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1307141" [puppet] - 10https://gerrit.wikimedia.org/r/1307111 (https://phabricator.wikimedia.org/T430651) (owner: 10Cathal Mooney) [12:57:24] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM, thx!" [puppet] - 10https://gerrit.wikimedia.org/r/1305718 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [12:57:47] (03CR) 10Elukey: [C:03+2] profile::cache::haproxy: remove .txt suffix from the webrequest sources [puppet] - 10https://gerrit.wikimedia.org/r/1307142 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [12:57:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:58:07] (03CR) 10Filippo Giunchedi: [C:03+1] hieradata: Indicate cloudweb2002-dev does not need depooling [puppet] - 10https://gerrit.wikimedia.org/r/1307118 (https://phabricator.wikimedia.org/T430918) (owner: 10Majavah) [12:58:32] (03PS2) 10Btullis: datahub: rename config to datahub_config in sub-charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307113 (https://phabricator.wikimedia.org/T402408) [12:58:33] (03PS2) 10Btullis: datahub-next: rename config to datahub_config in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307114 (https://phabricator.wikimedia.org/T402408) [12:58:33] (03PS2) 10Btullis: datahub: align upgrade-job ES auth with the sub-charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307115 (https://phabricator.wikimedia.org/T402408) [12:58:33] (03PS2) 10Btullis: datahub-next: authenticate to OpenSearch and verify its cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307116 (https://phabricator.wikimedia.org/T402408) [12:59:47] (03CR) 10Filippo Giunchedi: [C:03+1] hieradata: Use wildcard certificates for openstack APIs [puppet] - 10https://gerrit.wikimedia.org/r/1307120 (https://phabricator.wikimedia.org/T377055) (owner: 10Majavah) [13:00:04] Lucas_WMDE, urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T1300). [13:00:05] aude: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] deployment window is only my patch, so I can deploy it [13:00:17] I can’t deploy, in a meeting [13:00:17] o7 [13:00:22] aude: ok \o/ [13:00:52] (03PS9) 10Krinkle: varnish: Add edge fixup for corrupt upload.wm.o urls from mobileapps [puppet] - 10https://gerrit.wikimedia.org/r/1306230 (https://phabricator.wikimedia.org/T427623) [13:00:59] (03CR) 10Tiziano Fogli: [C:03+1] redis: clean up redis nrpe check components [puppet] - 10https://gerrit.wikimedia.org/r/1305077 (https://phabricator.wikimedia.org/T384924) (owner: 10Hnowlan) [13:01:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305773 (https://phabricator.wikimedia.org/T430227) (owner: 10Jdrewniak) [13:02:08] (03Merged) 10jenkins-bot: Phase 3 Legal contact link deployments. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305773 (https://phabricator.wikimedia.org/T430227) (owner: 10Jdrewniak) [13:02:24] !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1305773|Phase 3 Legal contact link deployments. (T430227)]] [13:02:27] T430227: [Footer link] Phase 3 deployments - https://phabricator.wikimedia.org/T430227 [13:03:25] (03PS4) 10Ladsgroup: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) [13:03:31] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) (owner: 10Ladsgroup) [13:04:34] !log aude@deploy1003 jdrewniak, aude: Backport for [[gerrit:1305773|Phase 3 Legal contact link deployments. (T430227)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:05:12] (03CR) 10Tiziano Fogli: [C:03+2] mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [13:05:23] !log aude@deploy1003 jdrewniak, aude: Continuing with deployment [13:06:09] (03PS3) 10JHathaway: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1305988 (https://phabricator.wikimedia.org/T372666) [13:06:25] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-esams rpd failure after enabling bgp 'graceful-shutdown' (June 2026) - https://phabricator.wikimedia.org/T429386#12080641 (10cmooney) 05Resolved→03Open Probably a little premature to close. I've opened ticket 2026-0616-761841 to confirm exactly what Jun... [13:06:25] RESOLVED: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:42] (03CR) 10CI reject: [V:04-1] Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1305988 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [13:07:17] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961 (10catherine.kelsey.wmde) 03NEW [13:08:01] (03CR) 10Filippo Giunchedi: [C:03+2] team-wmcs: introduce per-namespace neutron conntrack alert [alerts] - 10https://gerrit.wikimedia.org/r/1302151 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [13:09:44] !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305773|Phase 3 Legal contact link deployments. (T430227)]] (duration: 07m 20s) [13:09:48] T430227: [Footer link] Phase 3 deployments - https://phabricator.wikimedia.org/T430227 [13:11:03] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [13:11:04] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=97) rolling restart_daemons on A:wikidough [13:11:12] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [13:11:40] elukey: my opinions on gitbutler have a very bimodal distribution, so I'm hesitant to recommend it :D [13:11:51] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and (A:dnsbox) [13:11:53] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.dns.roll-restart (exit_code=97) rolling restart_daemons on A:dnsbox and (A:dnsbox) [13:11:55] but I do appreciate some of it enough that I haven't uninstalled it yet [13:12:23] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and (A:dnsbox) [13:12:43] (03CR) 10Btullis: [C:03+2] datahub: rename config to datahub_config in sub-charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307113 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [13:13:19] (03CR) 10Btullis: [C:03+2] datahub: align upgrade-job ES auth with the sub-charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307115 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [13:13:21] (03CR) 10Gergő Tisza: [C:03+1] varnish: Add edge fixup for corrupt upload.wm.o urls from mobileapps [puppet] - 10https://gerrit.wikimedia.org/r/1306230 (https://phabricator.wikimedia.org/T427623) (owner: 10Krinkle) [13:13:32] (03CR) 10Btullis: [C:03+2] datahub-next: authenticate to OpenSearch and verify its cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307116 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [13:14:04] (03PS3) 10Dreamy Jazz: maintain-views: Make change_tag_def and change_tag partially public [puppet] - 10https://gerrit.wikimedia.org/r/1307100 (https://phabricator.wikimedia.org/T386456) [13:14:04] (03CR) 10Dreamy Jazz: "EXPLAIN outputs for these custom views using `enwiki` as the wiki to test this on:" [puppet] - 10https://gerrit.wikimedia.org/r/1307100 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [13:15:00] (03Merged) 10jenkins-bot: datahub: rename config to datahub_config in sub-charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307113 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [13:17:12] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns1004.wikimedia.org [13:17:50] !log bking@cumin2003 conftool action : set/pooled=true; selector: dnsdisc=search,name=codfw [13:18:25] RESOLVED: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:05] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [13:22:48] !log bking@cumin2003 conftool action : set/pooled=true; selector: dnsdisc=search-omega,name=codfw [13:23:18] !log bking@cumin2003 conftool action : set/pooled=true; selector: dnsdisc=search-psi,name=codfw [13:23:20] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [13:25:26] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [13:26:17] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:27:16] (03CR) 10Btullis: [C:03+2] datahub-next: rename config to datahub_config in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307114 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [13:27:36] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:29:22] (03Merged) 10jenkins-bot: datahub-next: rename config to datahub_config in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307114 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [13:29:32] (03Merged) 10jenkins-bot: datahub: align upgrade-job ES auth with the sub-charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307115 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [13:29:36] (03Merged) 10jenkins-bot: datahub-next: authenticate to OpenSearch and verify its cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307116 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [13:29:45] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:29:57] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:30:58] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:32:03] !incidents [13:32:04] 8123 (ACKED, 24h 01m old) [3x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad) [13:32:04] 8134 (ACKED) Host 10.3.0.1 [13:32:04] 8135 (ACKED) Host cloudelastic.wikimedia.org [13:32:04] 8146 (RESOLVED) db1228 (paged)/MariaDB read only m3 (paged) [13:32:04] 8145 (RESOLVED) Host db1228 (paged) [13:32:05] 8144 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl2004:6443 probes/custom codfw) [13:32:05] 8143 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [13:32:06] 8127 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [13:32:06] 8126 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule@main) [13:32:07] 8124 (RESOLVED) [50x] ProbeDown sre () [13:32:07] 8141 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [13:34:42] !log bking@cumin2003 conftool action : set/pooled=false; selector: dnsdisc=search-psi,name=codfw [13:36:06] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:36:33] !log bking@cumin2003 conftool action : set/pooled=true; selector: dnsdisc=search-psi,name=codfw [13:37:10] !log bking@cumin2003 conftool action : set/pooled=false; selector: dnsdisc=search,name=codfw [13:38:57] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [13:40:28] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1307148 (https://phabricator.wikimedia.org/T430964) [13:40:39] !log bking@cumin2003 conftool action : set/pooled=true; selector: dnsdisc=search,name=codfw [13:42:29] (03CR) 10Majavah: [C:03+2] hieradata: Indicate cloudweb2002-dev does not need depooling [puppet] - 10https://gerrit.wikimedia.org/r/1307118 (https://phabricator.wikimedia.org/T430918) (owner: 10Majavah) [13:42:38] (03CR) 10Majavah: [C:03+2] hieradata: Use wildcard certificates for openstack APIs [puppet] - 10https://gerrit.wikimedia.org/r/1307120 (https://phabricator.wikimedia.org/T377055) (owner: 10Majavah) [13:44:08] !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/mw-pretrain: apply [13:44:21] !log blake@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-pretrain: apply [13:44:25] !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/mw-pretrain: apply [13:46:18] (03PS1) 10Majavah: hieradata: Mark WMCS backup roles as not needing depools [puppet] - 10https://gerrit.wikimedia.org/r/1307149 [13:47:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T430912 [13:47:12] T430912: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T430912 [13:47:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2209 with weight 0 T430912', diff saved to https://phabricator.wikimedia.org/P94702 and previous config saved to /var/cache/conftool/dbconfig/20260702-134719-fceratto.json [13:48:49] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:48:58] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline (but feel free to ignore)." [puppet] - 10https://gerrit.wikimedia.org/r/1307128 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [13:49:47] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [13:50:32] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1307068 (https://phabricator.wikimedia.org/T430912) (owner: 10Gerrit maintenance bot) [13:51:15] !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/mw-pretrain: apply [13:51:24] !log blake@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-pretrain: apply [13:51:53] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, +Andrew" [puppet] - 10https://gerrit.wikimedia.org/r/1307149 (owner: 10Majavah) [13:52:00] !log Starting s3 codfw failover from db2205 to db2209 - T430912 [13:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db2209 to s3 primary T430912', diff saved to https://phabricator.wikimedia.org/P94703 and previous config saved to /var/cache/conftool/dbconfig/20260702-135235-fceratto.json [13:52:40] T430912: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T430912 [13:53:43] o/ I think the window's open if I could deploy some private code? [13:53:55] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:54:10] !log installing sed security updates [13:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:15] (03CR) 10Majavah: [C:03+2] hieradata: Mark WMCS backup roles as not needing depools [puppet] - 10https://gerrit.wikimedia.org/r/1307149 (owner: 10Majavah) [13:55:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool db2205 T430912', diff saved to https://phabricator.wikimedia.org/P94704 and previous config saved to /var/cache/conftool/dbconfig/20260702-135505-fceratto.json [13:55:14] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2205: Repooling after switchover [13:55:42] (03PS1) 10Dpogorzelski: ml-serve: remove legacy kserve chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307155 (https://phabricator.wikimedia.org/T426823) [13:55:51] (03CR) 10Kosta Harlan: [C:03+1] maintain-views: Make change_tag_def and change_tag partially public [puppet] - 10https://gerrit.wikimedia.org/r/1307100 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [13:57:40] going to start [13:59:32] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:59:43] (03PS1) 10Hnowlan: archiva: migrate to blackbox HTTP check [puppet] - 10https://gerrit.wikimedia.org/r/1307158 (https://phabricator.wikimedia.org/T407117) [13:59:46] (03PS1) 10Hnowlan: archiva: clean up absented check [puppet] - 10https://gerrit.wikimedia.org/r/1307159 (https://phabricator.wikimedia.org/T407117) [13:59:47] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2205: Repooling after switchover [13:59:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2205: Repooling after switchover [14:04:45] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:04:54] (03CR) 10CI reject: [V:04-1] ml-serve: remove legacy kserve chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307155 (https://phabricator.wikimedia.org/T426823) (owner: 10Dpogorzelski) [14:05:22] (03CR) 10Dpogorzelski: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307155 (https://phabricator.wikimedia.org/T426823) (owner: 10Dpogorzelski) [14:05:57] (03PS1) 10Btullis: datahub: give the system-update job the token signing key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307161 (https://phabricator.wikimedia.org/T402408) [14:06:43] !log Deployed patch for T427287 [14:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:45] 06SRE, 06Infrastructure-Foundations, 10netops: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242#12081032 (10cmooney) Nokia came back to advise this is a known bug in 24.10.4. ` Hello Cathal, The gRPC server stops responding on SR-Linux?... [14:06:45] done [14:06:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:07:06] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:07:56] (03PS18) 10FNegri: sre.mysql.multiinstance_reboot: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [14:09:10] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2205: Repooling after switchover [14:09:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2205.codfw.wmnet with reason: Maintenance [14:09:54] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12081060 (10MoritzMuehlenhoff) [14:10:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2205 (T426633)', diff saved to https://phabricator.wikimedia.org/P94706 and previous config saved to /var/cache/conftool/dbconfig/20260702-140959-fceratto.json [14:11:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and (A:dnsbox) [14:11:32] (03PS2) 10Kamila Součková: mediawiki/php: don't pin libpcre for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1307128 (https://phabricator.wikimedia.org/T423714) [14:12:06] (03CR) 10Kamila Součková: mediawiki/php: don't pin libpcre for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1307128 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [14:12:40] !log installing rsync security updates [14:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:25] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1307128 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [14:14:41] (03CR) 10Kamila Součková: [C:03+2] mediawiki/php: don't pin libpcre for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1307128 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [14:16:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T426633)', diff saved to https://phabricator.wikimedia.org/P94707 and previous config saved to /var/cache/conftool/dbconfig/20260702-141621-fceratto.json [14:19:39] (03PS1) 10Hnowlan: cassandra: remove disabled CQL check [puppet] - 10https://gerrit.wikimedia.org/r/1307167 (https://phabricator.wikimedia.org/T407120) [14:26:16] (03CR) 10Krinkle: "It looks like you may want to use static.php here, given these are static files, right? That endpoint is designed for serving static files" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304109 (https://phabricator.wikimedia.org/T429599) (owner: 10Effie Mouzeli) [14:26:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P94708 and previous config saved to /var/cache/conftool/dbconfig/20260702-142628-fceratto.json [14:27:55] (03CR) 10Krinkle: "It also has metrics in place and might be easier to monitor for potential misuse, or surprising load patterns around disk access etc. It's" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304109 (https://phabricator.wikimedia.org/T429599) (owner: 10Effie Mouzeli) [14:28:00] (03PS2) 10Dpogorzelski: ml-serve: remove legacy kserve chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307155 (https://phabricator.wikimedia.org/T426823) [14:29:00] (03CR) 10FNegri: "I cleaned this patch up a bit, and tested it again using "test-cookbook" against a clouddb. I would like to get a +1 from data-persistence" [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [14:29:25] (03PS2) 10Krinkle: robots.php: Change Beta Cluster override from prepend to replace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265672 [14:29:51] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12081161 (10MoritzMuehlenhoff) [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T1430) [14:31:29] (03PS1) 10Elukey: profile::benthos: add x-provenance handling for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1307169 (https://phabricator.wikimedia.org/T427068) [14:32:00] !log installing libdbi-perl security updates [14:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:02] (03CR) 10CI reject: [V:04-1] profile::benthos: add x-provenance handling for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1307169 (https://phabricator.wikimedia.org/T427068) (owner: 10Elukey) [14:33:13] (03PS2) 10Elukey: profile::benthos: add x-provenance handling for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1307169 (https://phabricator.wikimedia.org/T427068) [14:36:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P94709 and previous config saved to /var/cache/conftool/dbconfig/20260702-143636-fceratto.json [14:40:20] (03PS1) 10Muehlenhoff: Remove alerts for the mirror lag [alerts] - 10https://gerrit.wikimedia.org/r/1307176 (https://phabricator.wikimedia.org/T416707) [14:40:57] 06SRE, 10SRE-Access-Requests: Superset data access request - https://phabricator.wikimedia.org/T430938#12081235 (10Aklapper) @ABendall-WMF: Hi and welcome! If you find time, please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/external/), so your 'LD... [14:41:07] 06SRE, 10SRE-Access-Requests: Superset data access request for abibendall - https://phabricator.wikimedia.org/T430938#12081236 (10Aklapper) [14:44:37] !incidents [14:44:37] 8134 (UNACKED, 24h 04m old) Host 10.3.0.1 [14:44:37] 8135 (UNACKED, 24h 03m old) Host cloudelastic.wikimedia.org [14:44:37] 8146 (RESOLVED) db1228 (paged)/MariaDB read only m3 (paged) [14:44:38] 8145 (RESOLVED) Host db1228 (paged) [14:44:38] 8144 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl2004:6443 probes/custom codfw) [14:44:38] 8143 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [14:44:45] !ack [14:44:46] 8134 (ACKED, 24h 04m old) Host 10.3.0.1 [14:44:46] 8135 (ACKED, 24h 04m old) Host cloudelastic.wikimedia.org [14:45:33] Raine: same pattern as before for oybal, I'm also going to resolve it? [14:45:39] yes please [14:45:41] thanks moritzm <3 [14:45:53] done [14:45:58] <3 [14:46:08] there's no acked, yet unresolved incidents any more [14:46:20] excellent, thank you [14:46:27] at least my late lunch is triggering only annoying things, not actually broken things :D [14:46:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T426633)', diff saved to https://phabricator.wikimedia.org/P94711 and previous config saved to /var/cache/conftool/dbconfig/20260702-144644-fceratto.json [14:48:20] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [14:48:56] (03PS1) 10Majavah: hieradata: Fix Keystone IDP callback URLs [puppet] - 10https://gerrit.wikimedia.org/r/1307178 (https://phabricator.wikimedia.org/T430957) [14:49:20] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [14:50:11] (03CR) 10Majavah: [C:03+2] "merging per approvals on task" [puppet] - 10https://gerrit.wikimedia.org/r/1307178 (https://phabricator.wikimedia.org/T430957) (owner: 10Majavah) [14:51:33] (03PS1) 10Giuseppe Lavagetto: Unblock taavi's access to hp [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1307179 [14:51:42] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Unblock taavi's access to hp [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1307179 (owner: 10Giuseppe Lavagetto) [14:52:20] 07sre-alert-triage, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T430139#12081306 (10Gehel) [14:52:58] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Unblock taavi - oblivian@cumin1003" [14:53:00] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Unblock taavi - oblivian@cumin1003 [14:53:05] !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1021: Security updates [14:53:05] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [14:53:13] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:53:13] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1021: Security updates [14:53:35] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:53:49] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Unblock taavi - oblivian@cumin1003 [14:53:51] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Unblock taavi - oblivian@cumin1003" [14:54:04] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12081319 (10Gehel) [14:54:09] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:54:55] 07sre-alert-triage, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1208) - https://phabricator.wikimedia.org/T430138#12081339 (10Gehel) [14:56:56] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [14:56:57] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Qwen36-27b test CUDA graphs on 1013/1014 with raised timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307105 (owner: 10Gkyziridis) [14:57:42] (03CR) 10Atsuko: [C:03+1] ml-serve: remove legacy kserve chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307155 (https://phabricator.wikimedia.org/T426823) (owner: 10Dpogorzelski) [14:57:45] (03CR) 10Cathal Mooney: "ha I'll have to be quicker off the draw in future! thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1307111 (https://phabricator.wikimedia.org/T430651) (owner: 10Cathal Mooney) [14:57:50] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:57:57] (03Abandoned) 10Cathal Mooney: Clouddumps fw rules for lvs healthcheck - remove production networks [puppet] - 10https://gerrit.wikimedia.org/r/1307111 (https://phabricator.wikimedia.org/T430651) (owner: 10Cathal Mooney) [14:58:05] (03CR) 10Btullis: [C:03+2] datahub: give the system-update job the token signing key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307161 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [14:58:41] PROBLEM - MariaDB Replica IO: pc1 on pc2021 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1021.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1021.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [15:00:05] andre and brennen: How many deployers does it take to do Train log triage deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T1500). [15:00:30] (03Merged) 10jenkins-bot: datahub: give the system-update job the token signing key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307161 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [15:00:41] RECOVERY - MariaDB Replica IO: pc1 on pc2021 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [15:02:21] !incidents [15:02:21] 8146 (RESOLVED) db1228 (paged)/MariaDB read only m3 (paged) [15:02:21] 8145 (RESOLVED) Host db1228 (paged) [15:02:22] 8144 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl2004:6443 probes/custom codfw) [15:02:22] 8143 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [15:02:58] (03PS1) 10Muehlenhoff: Move mirror1001 to insetup role for eventual decom [puppet] - 10https://gerrit.wikimedia.org/r/1307180 (https://phabricator.wikimedia.org/T416707) [15:06:32] (03CR) 10Muehlenhoff: [C:03+2] Move mirror1001 to insetup role for eventual decom [puppet] - 10https://gerrit.wikimedia.org/r/1307180 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [15:08:33] !log installing Tomcat security updates [15:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:37] !log installing giflib security updates [15:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:52] (03PS1) 10Hnowlan: rpkivalidator: migrate TCP check to blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1307182 (https://phabricator.wikimedia.org/T407117) [15:14:55] (03PS1) 10Hnowlan: rpkivalidator: remove disabled nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/1307183 (https://phabricator.wikimedia.org/T407117) [15:15:13] !log root@cumin1003 START - Cookbook sre.mysql.pool pool pc1021: Security updates [15:15:13] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [15:15:26] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [15:15:26] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1021: Security updates [15:15:43] (03CR) 10Ssingh: [C:03+1] taskgen: allow profile_yaml to render templates [puppet] - 10https://gerrit.wikimedia.org/r/1306481 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [15:17:04] (03Abandoned) 10Ssingh: Revert "Failover url-downloader.eqiad CNAME to one of the new Trixie hosts" [dns] - 10https://gerrit.wikimedia.org/r/1306741 (owner: 10Ssingh) [15:18:27] (03CR) 10Ssingh: "This is on me Jesse, I have not got a chance to go through this yet but I am fine with a single change." [puppet] - 10https://gerrit.wikimedia.org/r/1305984 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [15:20:34] !log installing busybox updates from trixie point release [15:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:37] (03PS1) 10Aqu: Airflow-main: Add GCP Airflow connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307184 (https://phabricator.wikimedia.org/T427457) [15:24:00] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12081575 (10MoritzMuehlenhoff) [15:24:17] !log installing busybox updates from bookworm point release [15:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:54] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1033.eqiad.wmnet - https://phabricator.wikimedia.org/T408772#12081588 (10VRiley-WMF) a:03VRiley-WMF [15:30:56] (03CR) 10Hnowlan: [C:03+2] redis: clean up redis nrpe check components [puppet] - 10https://gerrit.wikimedia.org/r/1305077 (https://phabricator.wikimedia.org/T384924) (owner: 10Hnowlan) [15:32:23] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12081645 (10MoritzMuehlenhoff) [15:38:01] (03PS8) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [15:42:12] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-test-master1003.eqiad.wmnet with OS bookworm [15:45:49] !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1023: Security updates [15:45:49] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [15:45:57] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [15:45:57] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1023: Security updates [15:47:32] (03CR) 10CWilliams: sre.mysql.multiinstance_reboot: new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [15:50:09] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1033.eqiad.wmnet - https://phabricator.wikimedia.org/T408772#12081739 (10VRiley-WMF) 05Open→03Resolved [15:50:11] (03CR) 10Elukey: Add sre.hosts.bmc-user-mgmt.py (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [15:51:45] PROBLEM - MariaDB Replica IO: pc3 on pc2023 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1023.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1023.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [15:53:43] RECOVERY - MariaDB Replica IO: pc3 on pc2023 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [15:54:04] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1003.eqiad.wmnet with reason: host reimage [15:54:46] !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/mw-pretrain: apply [15:54:48] !log blake@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-pretrain: apply [15:54:54] !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/mw-pretrain: apply [15:57:26] (03CR) 10Elukey: [C:03+1] admin_ng: Fix KServe view RBAC on install_kserve_resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306892 (owner: 10Bartosz Wójtowicz) [15:57:46] (03CR) 10BCornwall: [C:03+1] "Went through and they all look good per the docs (they're identical in description between the legacy and current facts). The scariest one" [puppet] - 10https://gerrit.wikimedia.org/r/1305984 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [15:58:28] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1003.eqiad.wmnet with reason: host reimage [16:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260702T1600). [16:00:05] Dreamy_Jazz and dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:12] o/ [16:00:16] \o [16:00:36] o/ be right with you [16:00:44] I believe that the undeployment of iPoid will take a long time [16:00:47] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: remove legacy kserve chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307155 (https://phabricator.wikimedia.org/T426823) (owner: 10Dpogorzelski) [16:01:02] 'twill yeah [16:01:03] or, 'tmight [16:01:17] Maybe dancy's change goes first then? [16:01:18] let me look at dancy's patch first and probably get that out of the way, if you don't mind Dreamy_Jazz? [16:01:20] rad [16:01:22] Mine's a quickie [16:01:42] Already deployed to deployment-prep. [16:02:08] (03CR) 10Kamila Součková: [C:03+1] k8s: add new stacked control planes wikikube-ctrl100[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/1295483 (https://phabricator.wikimedia.org/T418920) (owner: 10Jasmine) [16:02:37] (03CR) 10Dzahn: "thanks for already deploying this" [puppet] - 10https://gerrit.wikimedia.org/r/1306952 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [16:02:58] rzl: thanks for handling the patches [16:03:59] jhathaway: actually if you can eyeball dancy's I'd appreciate it -- I think it's only relevant to deployment-prep but any time we start looking at puppetserver certs I get paranoid :) [16:04:11] looking... [16:05:55] that patch is giving me a deja vu feeling [16:06:03] didn't we already have some other mechanism in puppet to do that? [16:06:30] To add the missing newline? Or to pin the puppetmaster CA cert? [16:06:38] to pin the cert [16:07:13] I didn't find one at the time, but one already exists, I'm happy to use it. [16:08:37] !log root@cumin1003 START - Cookbook sre.mysql.pool pool pc1023: Security updates [16:08:37] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [16:08:51] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [16:08:51] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1023: Security updates [16:09:05] (03CR) 10FNegri: sre.mysql.multiinstance_reboot: new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [16:09:10] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml-serve: remove legacy kserve chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307155 (https://phabricator.wikimedia.org/T426823) (owner: 10Dpogorzelski) [16:09:33] (03CR) 10RLazarus: [C:03+2] services_proxy: Remove ipoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1306903 (https://phabricator.wikimedia.org/T416623) (owner: 10Clément Goubert) [16:09:43] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:59] (03CR) 10CWilliams: sre.mysql.multiinstance_reboot: new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [16:11:40] !log cwilliams@cumin1003 START - Cookbook sre.mysql.multiinstance_reboot for db-test[2001-2002].codfw.wmnet [16:11:55] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.multiinstance_reboot (exit_code=99) for db-test[2001-2002].codfw.wmnet [16:13:17] dancy: what is the value of $facts['puppet_config']['localcacert'], /var/lib/puppet/ssl/certs/ca.pem [16:13:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-master1003.eqiad.wmnet with OS bookworm [16:14:38] jhathaway: That's right. [16:14:43] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST events) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:15:24] dancy: any why is the newline not present, it seems to be there when I spote checked sretest1005 [16:16:13] sretest1005 isn't using profile::puppet::agent::puppetserver_ca_cert, so this bit of code is irrelevant there. [16:16:21] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-test-master1004.eqiad.wmnet with OS bookworm [16:16:37] The only thing using profile::puppet::agent::puppetserver_ca_cert right now is the deployment-prep horizon project. [16:18:56] (puppet's running on the deployment host now for the ipoid listener removal, then I'll scap that out) [16:19:41] (03CR) 10JHathaway: profile::puppet::agent: write pinned CA cert with a trailing newline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1306745 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [16:19:58] dancy: I made a small suggesstion, to avoid dup newlines [16:20:35] (03CR) 10Federico Ceratto: sre.mysql.multiinstance_reboot: new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [16:21:23] (03PS3) 10Ahmon Dancy: profile::puppet::agent: write pinned CA cert with a trailing newline [puppet] - 10https://gerrit.wikimedia.org/r/1306745 (https://phabricator.wikimedia.org/T429413) [16:21:28] (03CR) 10Ahmon Dancy: profile::puppet::agent: write pinned CA cert with a trailing newline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1306745 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [16:21:42] jhathaway: Applied [16:22:33] thanks [16:22:59] rzl: patch should be fine to rollout [16:23:06] thank you! [16:23:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: sync [16:23:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: sync [16:23:52] Taavi: I did now find profile::base::certificates::puppet_ca_content. Is that what you were thinking of? [16:24:26] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [16:25:03] !log cwilliams@cumin1003 START - Cookbook sre.mysql.multiinstance_reboot for db-test[2001-2002].codfw.wmnet [16:25:34] !log rzl@deploy1003 Started scap sync-world: T416623 [16:25:37] T416623: Decommission NodeJS IPoid service - https://phabricator.wikimedia.org/T416623 [16:25:45] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.multiinstance_reboot (exit_code=99) for db-test[2001-2002].codfw.wmnet [16:26:13] (03CR) 10JHathaway: Add sre.hosts.bmc-user-mgmt.py (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [16:26:33] !log rzl@deploy1003 rzl: T416623 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:27:03] (03PS1) 10David Caro: access: add thibaut to the wmcs-roots group [puppet] - 10https://gerrit.wikimedia.org/r/1307195 (https://phabricator.wikimedia.org/T427914) [16:27:20] Dreamy_Jazz: sanity check at mw-debug please? can't imagine anything will break, since we just removed an envoy listener to the service I think you turned off already, but just the same [16:27:34] Sure [16:28:02] (03CR) 10CI reject: [V:04-1] access: add thibaut to the wmcs-roots group [puppet] - 10https://gerrit.wikimedia.org/r/1307195 (https://phabricator.wikimedia.org/T427914) (owner: 10David Caro) [16:28:09] Checking that the consumer of iPoid opensearch still works [16:28:24] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1004.eqiad.wmnet with reason: host reimage [16:30:00] I still see the data from the opensearch instance + no errors in logstash that I can see [16:30:03] (03CR) 10David Caro: "oh, we need to add him to the user list too, with ssh key and such" [puppet] - 10https://gerrit.wikimedia.org/r/1307195 (https://phabricator.wikimedia.org/T427914) (owner: 10David Caro) [16:30:10] So should be fine to proceed [16:30:18] (03CR) 10Dzahn: "the user itself does not exist yet; you will have to add it as well in this file, around line 7278 right before the "ldap_only_users" sect" [puppet] - 10https://gerrit.wikimedia.org/r/1307195 (https://phabricator.wikimedia.org/T427914) (owner: 10David Caro) [16:30:57] thanks [16:30:59] !log rzl@deploy1003 rzl: Continuing with deployment [16:31:50] (03CR) 10Dzahn: "tagging the linked ticket as access request so it can follow the regular process. please also check the boxes of the standard template for" [puppet] - 10https://gerrit.wikimedia.org/r/1307195 (https://phabricator.wikimedia.org/T427914) (owner: 10David Caro) [16:32:23] 10SRE-Access-Requests, 06tools-platform-team, 13Patch-For-Review: Onboard Thibaut Le Page to Wikimedia Foundation as Staff SWE in Tools Platform - https://phabricator.wikimedia.org/T427914#12082028 (10Dzahn) [16:33:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1004.eqiad.wmnet with reason: host reimage [16:33:37] (03Abandoned) 10David Caro: access: add thibaut to the wmcs-roots group [puppet] - 10https://gerrit.wikimedia.org/r/1307195 (https://phabricator.wikimedia.org/T427914) (owner: 10David Caro) [16:34:37] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [16:35:13] !log rzl@deploy1003 Finished scap sync-world: T416623 (duration: 10m 19s) [16:35:16] T416623: Decommission NodeJS IPoid service - https://phabricator.wikimedia.org/T416623 [16:36:01] 10SRE-Access-Requests, 06tools-platform-team, 13Patch-For-Review: Onboard Thibaut Le Page to Wikimedia Foundation as Staff SWE in Tools Platform - https://phabricator.wikimedia.org/T427914#12082044 (10dcaro) [16:36:10] (03CR) 10RLazarus: [C:03+2] wmnet: Remove ipoid CNAME [dns] - 10https://gerrit.wikimedia.org/r/1306902 (https://phabricator.wikimedia.org/T416623) (owner: 10Clément Goubert) [16:36:31] !log rzl@dns1004 START - running authdns-update [16:37:08] Merge these changes? (yes/no)? y [16:37:09] Aborting merge. [16:37:13] 10SRE-Access-Requests, 06tools-platform-team, 13Patch-For-Review: Onboard Thibaut Le Page to Wikimedia Foundation as Staff SWE in Tools Platform - https://phabricator.wikimedia.org/T427914#12082063 (10TLepage-WMF) [16:37:17] lol I'm glad this is still unfixed *somewhere* in our infra [16:37:21] !log rzl@dns1004 START - running authdns-update [16:37:25] :D [16:37:55] (03CR) 10Pppery: "(Trying to get this merged - patches here have been blocked on i18n-check for weeks)" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1307129 (owner: 10L10n-bot) [16:38:27] (03PS1) 10SBassett: mediawiki.action.edit.preview: Fix compat with `