[00:22:40] (03PS1) 10Catrope: Enable OATHAuth passkey features in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229219 (https://phabricator.wikimedia.org/T415146) [00:23:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229219 (https://phabricator.wikimedia.org/T415146) (owner: 10Catrope) [00:34:18] (03PS1) 10Zabe: Add il_target_id to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1229221 (https://phabricator.wikimedia.org/T299953) [00:40:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1229225 [00:40:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1229225 (owner: 10TrainBranchBot) [00:52:39] (03PS2) 10Zabe: maintain-views: Show il_target_id references in linktarget [puppet] - 10https://gerrit.wikimedia.org/r/1229221 (https://phabricator.wikimedia.org/T299953) [00:53:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1229225 (owner: 10TrainBranchBot) [00:59:17] (03PS1) 10Jforrester: mcrouter: Allow configuring secondary replicated caches [puppet] - 10https://gerrit.wikimedia.org/r/1229229 (https://phabricator.wikimedia.org/T411807) [00:59:19] (03PS1) 10Jforrester: [DNM] memcached: Point to the replicated Wikifunctions cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229232 (https://phabricator.wikimedia.org/T411807) [00:59:20] (03PS1) 10Jforrester: [WIP] mcrouter: Configure the Wikifunctions pool as replicated [puppet] - 10https://gerrit.wikimedia.org/r/1229230 (https://phabricator.wikimedia.org/T411807) [00:59:22] (03PS1) 10Jforrester: [DNM] memcached: Drop the local-only Wikifunctions cache route [puppet] - 10https://gerrit.wikimedia.org/r/1229231 (https://phabricator.wikimedia.org/T411807) [01:00:58] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:03:29] !log start populating il_target_id on s3 and s6 wikis # T413668 [01:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:34] T413668: Run the data migration of imagelinks - https://phabricator.wikimedia.org/T413668 [01:10:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1229234 [01:10:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1229234 (owner: 10TrainBranchBot) [01:14:07] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 08s) [01:19:22] (03PS1) 10Zabe: Start reading from il_target_id on cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229238 (https://phabricator.wikimedia.org/T413669) [01:21:24] (03CR) 10Zabe: [C:03+2] Start reading from il_target_id on cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229238 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [01:22:13] (03Merged) 10jenkins-bot: Start reading from il_target_id on cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229238 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [01:23:21] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1229238|Start reading from il_target_id on cebwiki (T413669)]] [01:23:26] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [01:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:25:37] !log zabe@deploy2002 zabe: Backport for [[gerrit:1229238|Start reading from il_target_id on cebwiki (T413669)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:28:28] !log zabe@deploy2002 zabe: Continuing with sync [01:32:39] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1229238|Start reading from il_target_id on cebwiki (T413669)]] (duration: 09m 18s) [01:32:44] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [01:33:04] (03CR) 10Zabe: [C:03+2] CommonSettings-labs: Remove redundant code for loading/configuring Phonos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225075 (owner: 10A smart kitten) [01:33:57] (03Merged) 10jenkins-bot: CommonSettings-labs: Remove redundant code for loading/configuring Phonos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225075 (owner: 10A smart kitten) [01:34:16] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1229234 (owner: 10TrainBranchBot) [01:51:23] (03PS3) 10Zabe: Cleanup manage-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226381 [01:54:13] (03PS4) 10Zabe: Cleanup manage-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226381 [02:01:28] (03CR) 10Zabe: [C:03+2] Cleanup manage-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226381 (owner: 10Zabe) [02:02:51] (03Merged) 10jenkins-bot: Cleanup manage-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226381 (owner: 10Zabe) [02:10:18] (03CR) 10Zabe: [C:03+2] Removed dropped special page from disabled query pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226962 (https://phabricator.wikimedia.org/T414202) (owner: 10Zabe) [02:11:08] (03Merged) 10jenkins-bot: Removed dropped special page from disabled query pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226962 (https://phabricator.wikimedia.org/T414202) (owner: 10Zabe) [02:12:06] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1226962|Removed dropped special page from disabled query pages (T414202)]] [02:12:11] T414202: Disable GloballyUnusedFiles special page on commons - https://phabricator.wikimedia.org/T414202 [02:14:18] !log zabe@deploy2002 zabe: Backport for [[gerrit:1226962|Removed dropped special page from disabled query pages (T414202)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:15:03] !log zabe@deploy2002 zabe: Continuing with sync [02:19:09] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226962|Removed dropped special page from disabled query pages (T414202)]] (duration: 07m 03s) [02:19:14] T414202: Disable GloballyUnusedFiles special page on commons - https://phabricator.wikimedia.org/T414202 [02:34:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11540134 (10Jclark-ctr) Dell sr got rejected. Resubmitted will be additional day for delivery [02:34:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:34:51] (03PS1) 10Tim Starling: Remove unused LoginNotify config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229252 (https://phabricator.wikimedia.org/T412939) [03:03:00] (03CR) 10Ori: [C:03+2] "Beta-only changed. Tested manually on Beta." [puppet] - 10https://gerrit.wikimedia.org/r/1219190 (https://phabricator.wikimedia.org/T99740) (owner: 10Krinkle) [03:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:39:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:09:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87800 and previous config saved to /var/cache/conftool/dbconfig/20260121-040949-marostegui.json [04:09:56] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:09:58] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:19:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P87801 and previous config saved to /var/cache/conftool/dbconfig/20260121-041958-marostegui.json [04:30:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P87802 and previous config saved to /var/cache/conftool/dbconfig/20260121-043006-marostegui.json [04:40:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87803 and previous config saved to /var/cache/conftool/dbconfig/20260121-044015-marostegui.json [04:40:22] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:40:23] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:40:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [04:40:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87804 and previous config saved to /var/cache/conftool/dbconfig/20260121-044039-marostegui.json [04:48:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87805 and previous config saved to /var/cache/conftool/dbconfig/20260121-044828-marostegui.json [04:48:36] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:48:36] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:53:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87806 and previous config saved to /var/cache/conftool/dbconfig/20260121-045342-marostegui.json [04:53:50] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:53:51] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:58:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P87807 and previous config saved to /var/cache/conftool/dbconfig/20260121-045837-marostegui.json [05:03:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P87808 and previous config saved to /var/cache/conftool/dbconfig/20260121-050351-marostegui.json [05:08:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P87809 and previous config saved to /var/cache/conftool/dbconfig/20260121-050845-marostegui.json [05:09:05] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229279 [05:09:13] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P87810 and previous config saved to /var/cache/conftool/dbconfig/20260121-051359-marostegui.json [05:18:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87811 and previous config saved to /var/cache/conftool/dbconfig/20260121-051854-marostegui.json [05:19:02] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:19:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:19:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:24:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87812 and previous config saved to /var/cache/conftool/dbconfig/20260121-052408-marostegui.json [05:24:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:24:15] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:24:16] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:24:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [05:24:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2154 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87813 and previous config saved to /var/cache/conftool/dbconfig/20260121-052432-marostegui.json [05:34:13] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:37:37] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229434 [05:53:46] (03CR) 10Marostegui: [C:03+2] Revert "dbproxy1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1229085 (owner: 10Marostegui) [05:55:21] (03PS1) 10Marostegui: wmnet: Switch m1-master [dns] - 10https://gerrit.wikimedia.org/r/1229439 (https://phabricator.wikimedia.org/T414656) [05:56:19] (03PS1) 10Marostegui: Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1229440 [05:56:25] (03PS1) 10Marostegui: Revert "db2179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1229441 [05:56:34] (03CR) 10Marostegui: [C:03+2] wmnet: Switch m1-master [dns] - 10https://gerrit.wikimedia.org/r/1229439 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [05:56:38] !log marostegui@dns1006 START - running authdns-update [05:57:08] !log Promote dbproxy1024 to m1-master T414656 [05:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:13] T414656: Migrate dbproxy* to Debian Trixie - https://phabricator.wikimedia.org/T414656 [05:57:47] !log marostegui@dns1006 END - running authdns-update [05:59:21] (03CR) 10Marostegui: [C:03+2] Revert "db2179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1229441 (owner: 10Marostegui) [05:59:29] (03CR) 10Marostegui: [C:03+2] Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1229440 (owner: 10Marostegui) [06:00:53] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db1160: After schema change [06:01:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2179: After schema change [06:01:31] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.newpool (exit_code=97) pool db1160: After schema change [06:01:42] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db1160: After schema change [06:01:49] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.newpool (exit_code=97) pool db2179: After schema change [06:01:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2179: After schema change [06:16:49] (03PS1) 10Samwilson: Revert "jquery.wikiEditor: enable resizing drag bar without RTP" [extensions/WikiEditor] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229445 [06:17:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T410589)', diff saved to https://phabricator.wikimedia.org/P87818 and previous config saved to /var/cache/conftool/dbconfig/20260121-061752-ladsgroup.json [06:17:58] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:21:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samwilson@deploy2002 using scap backport" [extensions/WikiEditor] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229445 (owner: 10Samwilson) [06:28:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P87819 and previous config saved to /var/cache/conftool/dbconfig/20260121-062801-ladsgroup.json [06:30:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11540335 (10Marostegui) [06:33:53] (03Merged) 10jenkins-bot: Revert "jquery.wikiEditor: enable resizing drag bar without RTP" [extensions/WikiEditor] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229445 (owner: 10Samwilson) [06:34:38] !log samwilson@deploy2002 Started scap sync-world: Backport for [[gerrit:1229445|Revert "jquery.wikiEditor: enable resizing drag bar without RTP"]] [06:34:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:37:00] !log samwilson@deploy2002 samwilson: Backport for [[gerrit:1229445|Revert "jquery.wikiEditor: enable resizing drag bar without RTP"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:38:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P87822 and previous config saved to /var/cache/conftool/dbconfig/20260121-063809-ladsgroup.json [06:40:10] !log samwilson@deploy2002 samwilson: Continuing with sync [06:44:16] !log samwilson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1229445|Revert "jquery.wikiEditor: enable resizing drag bar without RTP"]] (duration: 09m 37s) [06:47:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db1160: After schema change [06:47:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2179: After schema change [06:48:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T410589)', diff saved to https://phabricator.wikimedia.org/P87825 and previous config saved to /var/cache/conftool/dbconfig/20260121-064817-ladsgroup.json [06:48:23] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:48:34] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [06:53:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87826 and previous config saved to /var/cache/conftool/dbconfig/20260121-065357-marostegui.json [06:54:04] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:54:04] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T0700) [07:04:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P87827 and previous config saved to /var/cache/conftool/dbconfig/20260121-070405-marostegui.json [07:09:56] (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [07:14:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P87828 and previous config saved to /var/cache/conftool/dbconfig/20260121-071414-marostegui.json [07:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:19:39] (03PS15) 10Daniel Kinzler: rest gateway: add tests for chart rendering [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 [07:19:43] (03CR) 10Daniel Kinzler: rest gateway: add tests for chart rendering (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 (owner: 10Daniel Kinzler) [07:24:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87829 and previous config saved to /var/cache/conftool/dbconfig/20260121-072422-marostegui.json [07:24:29] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [07:24:29] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [07:24:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [07:24:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87830 and previous config saved to /var/cache/conftool/dbconfig/20260121-072446-marostegui.json [07:24:50] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:26:48] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:34:45] (03CR) 10Muehlenhoff: admin: Add johannnes89 to LDAP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1229200 (https://phabricator.wikimedia.org/T414789) (owner: 10Federico Ceratto) [07:39:59] (03CR) 10Ryan Kemper: [C:03+1] data-platform: Show affected DC on blackbox alerts [alerts] - 10https://gerrit.wikimedia.org/r/1229203 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [07:53:44] (03PS1) 10Bartosz Wójtowicz: ml-services: Lower resource limits for article descriptions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229477 (https://phabricator.wikimedia.org/T414431) [07:55:12] (03PS1) 10Bartosz Wójtowicz: ml-services: Update image for Article Topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229491 (https://phabricator.wikimedia.org/T414573) [08:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T0800). nyaa~ [08:00:05] Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:09] o/ [08:04:06] Anyone available for a quick deploy? [08:07:02] !log a-pizzata@deploy2002 Started deploy [analytics/refinery@4f6560f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4f6560f9] [08:08:07] !log a-pizzata@deploy2002 Finished deploy [analytics/refinery@4f6560f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4f6560f9] (duration: 01m 05s) [08:08:58] !log a-pizzata@deploy2002 Started deploy [analytics/refinery@4f6560f]: Regular analytics weekly train [analytics/refinery@4f6560f9] [08:11:40] !log a-pizzata@deploy2002 Finished deploy [analytics/refinery@4f6560f]: Regular analytics weekly train [analytics/refinery@4f6560f9] (duration: 02m 43s) [08:17:46] (03PS1) 10Muehlenhoff: Record LDAP access for bjiangwmf [puppet] - 10https://gerrit.wikimedia.org/r/1229512 [08:18:09] !log a-pizzata@deploy2002 Started deploy [analytics/refinery@4f6560f] (thin): Regular analytics weekly train THIN [analytics/refinery@4f6560f9] [08:19:26] !log a-pizzata@deploy2002 Finished deploy [analytics/refinery@4f6560f] (thin): Regular analytics weekly train THIN [analytics/refinery@4f6560f9] (duration: 01m 16s) [08:20:26] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for bjiangwmf [puppet] - 10https://gerrit.wikimedia.org/r/1229512 (owner: 10Muehlenhoff) [08:34:45] (03PS1) 10Joal: Update services_proxy/envoy.yaml for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/1229517 (https://phabricator.wikimedia.org/T411989) [08:35:30] Well Seems no one is available [08:35:35] Will try to re-schedule my patches [08:35:53] (03PS1) 10Slyngshede: java: create openjdk-21 image (JDK) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1229519 (https://phabricator.wikimedia.org/T412826) [08:37:22] (03CR) 10Muehlenhoff: java: create openjdk-21 image (JDK) (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1229519 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [08:40:56] (03PS2) 10Slyngshede: java: create openjdk-21 image (JDK) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1229519 (https://phabricator.wikimedia.org/T412826) [08:42:49] (03PS3) 10Slyngshede: java: create openjdk-21 image (JDK) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1229519 (https://phabricator.wikimedia.org/T412826) [08:44:10] (03CR) 10Slyngshede: java: create openjdk-21 image (JDK) (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1229519 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [08:48:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1229519 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [08:51:42] (03PS1) 10Muehlenhoff: Mark the ops group as require_fido: true [puppet] - 10https://gerrit.wikimedia.org/r/1229520 [08:52:26] (03CR) 10CI reject: [V:04-1] Mark the ops group as require_fido: true [puppet] - 10https://gerrit.wikimedia.org/r/1229520 (owner: 10Muehlenhoff) [08:55:37] 06SRE, 10MediaWiki-Uploading, 06ServiceOps new, 10ServiceOps-Mediawiki: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#11540440 (10MLechvien-WMF) Apologies for the late follow up. @Grand-Duc Do you still experience the issue here? [08:57:11] (03PS2) 10Muehlenhoff: Mark the ops group as require_fido: true [puppet] - 10https://gerrit.wikimedia.org/r/1229520 [08:59:09] (03CR) 10Brouberol: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1229517 (https://phabricator.wikimedia.org/T411989) (owner: 10Joal) [09:00:04] andre and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T0900). [09:00:09] o/ [09:02:44] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229522 (https://phabricator.wikimedia.org/T413803) [09:02:48] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229522 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [09:03:37] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229522 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [09:07:11] (03PS3) 10Vgutierrez: cache::upload: enable global ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1228571 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:07:18] (03PS4) 10Vgutierrez: cache::upload: enable global ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1228571 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:08:05] (03PS5) 10Vgutierrez: cache::upload: enable global ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1228571 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:08:19] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228571 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:09:46] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.12 refs T413803 [09:09:51] T413803: 1.46.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T413803 [09:22:17] (03CR) 10Muehlenhoff: [C:03+1] "If the key has been validated out of band, good to go" [puppet] - 10https://gerrit.wikimedia.org/r/1229195 (https://phabricator.wikimedia.org/T414830) (owner: 10Federico Ceratto) [09:24:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:25:24] (03CR) 10Vgutierrez: [C:03+2] cache::upload: enable global ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1228571 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:27:53] (03CR) 10Slyngshede: [V:03+2] "Built successfully locally:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1229519 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [09:27:55] (03CR) 10Slyngshede: [V:03+2 C:03+2] java: create openjdk-21 image (JDK) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1229519 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [09:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:35] (03CR) 10Elukey: [C:03+2] services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [09:37:03] (03PS1) 10Joal: Update dse-k8s-eqiad airflow values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229524 (https://phabricator.wikimedia.org/T411989) [09:38:43] (03CR) 10CI reject: [V:04-1] Update dse-k8s-eqiad airflow values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229524 (https://phabricator.wikimedia.org/T411989) (owner: 10Joal) [09:39:44] (03PS3) 10Vgutierrez: cache::upload: enable global ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1228572 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:40:28] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228572 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:44:40] (03CR) 10Elukey: [C:03+1] Mark the ops group as require_fido: true [puppet] - 10https://gerrit.wikimedia.org/r/1229520 (owner: 10Muehlenhoff) [09:47:31] (03CR) 10Vgutierrez: [C:03+2] cache::upload: enable global ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1228572 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:56:27] (03CR) 10Jgiannelos: mobileapps: Set limits on memory usage to avoid latency increase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227799 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [09:57:56] (03CR) 10Muehlenhoff: [C:03+2] Remove Puppet 5 volatile directory from backups [puppet] - 10https://gerrit.wikimedia.org/r/1229110 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:58:22] vgutierrez: ok to puppet-merge your ratelimit patch along? [09:58:29] wait... [09:58:34] (03CR) 10Brouberol: [C:03+2] Update services_proxy/envoy.yaml for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/1229517 (https://phabricator.wikimedia.org/T411989) (owner: 10Joal) [09:58:35] crap [09:58:35] sure [09:58:37] yes [09:58:43] /o\ [09:58:46] sorry about that [09:58:50] (03PS4) 10Jgiannelos: mobileapps: Set limits on memory usage to avoid latency increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227799 (https://phabricator.wikimedia.org/T410296) [09:58:51] np at all! [09:59:48] it's now merged [10:00:18] thx :D [10:00:28] rerunning puppet on A:cp-upload_codfw :_) [10:03:27] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::updatenetboot [puppet] - 10https://gerrit.wikimedia.org/r/1229107 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:05:03] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1229526 (https://phabricator.wikimedia.org/T415171) [10:05:11] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1229527 (https://phabricator.wikimedia.org/T415171) [10:06:43] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T415172 (10GGalofre-WMF) 03NEW [10:10:23] (03PS3) 10Vgutierrez: cache::upload: enable global ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1228573 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:10:29] (03PS4) 10Vgutierrez: cache::upload: enable global ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1228573 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:10:50] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228573 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:12:55] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [10:13:01] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [10:18:15] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11540605 (10Novem_Linguae) [10:18:20] !log installing setuptools security updates [10:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:49] (03PS2) 10FNegri: cloudnfs: Add wikiqlever project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/1229159 (https://phabricator.wikimedia.org/T414986) [10:19:50] (03CR) 10FNegri: cloudnfs: Add wikiqlever project to dumps mounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229159 (https://phabricator.wikimedia.org/T414986) (owner: 10FNegri) [10:20:52] (03CR) 10Vgutierrez: [C:03+2] cache::upload: enable global ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1228573 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:25:04] (03PS5) 10Blake: sre.switchdc.mediawiki: Automate scap lock/unlock [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) [10:26:06] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528#11540620 (10elukey) ` elukey@deploy2002:~$ curl -i "https://kartotherian.k8s-staging.discovery.wmnet:30443/img/osm-intl,12,31.807,34.673,400x400.png?lang=he&domain=he.wikipedia.org&title=%D... [10:26:24] (03CR) 10Blake: "Other cookbooks (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/deploy/hiddenparma." [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) (owner: 10Blake) [10:26:51] (03CR) 10Cathal Mooney: [C:03+1] Mark the ops group as require_fido: true [puppet] - 10https://gerrit.wikimedia.org/r/1229520 (owner: 10Muehlenhoff) [10:28:22] (03PS6) 10Blake: sre.switchdc.mediawiki: Automate scap lock/unlock [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) [10:33:30] PROBLEM - MariaDB Replica Lag: s1 on dbstore1008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:34:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on dbstore1008.eqiad.wmnet with reason: long schema change [10:34:35] (03CR) 10Joal: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229524 (https://phabricator.wikimedia.org/T411989) (owner: 10Joal) [10:34:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:35:38] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 3 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11540661 (10MatthewVernon) There is currently 3T of apus quota allocated to the docker-registry user cf. [[... [10:43:16] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie [10:44:24] (03PS3) 10Vgutierrez: cache::upload: enable global ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1228574 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:44:29] (03PS4) 10Vgutierrez: cache::upload: enable global ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1228574 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:44:49] (03CR) 10Muehlenhoff: [C:03+2] Mark the ops group as require_fido: true [puppet] - 10https://gerrit.wikimedia.org/r/1229520 (owner: 10Muehlenhoff) [10:46:41] (03CR) 10Majavah: Mark the ops group as require_fido: true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229520 (owner: 10Muehlenhoff) [10:47:39] (03PS1) 10Elukey: role::ml_builder: add docker engine settings [puppet] - 10https://gerrit.wikimedia.org/r/1229531 (https://phabricator.wikimedia.org/T385173) [10:47:53] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228574 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:49:17] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1032.eqiad.wmnet with OS trixie [10:49:38] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229531 (https://phabricator.wikimedia.org/T385173) (owner: 10Elukey) [10:49:40] (03PS1) 10Pmiazga: noop: Improve APIGW/RESTGW README file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229534 [10:52:41] (03CR) 10Muehlenhoff: [C:03+2] Mark the ops group as require_fido: true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229520 (owner: 10Muehlenhoff) [10:57:45] (03PS1) 10Majavah: admin: Enforce FIDO key requirement as a test [puppet] - 10https://gerrit.wikimedia.org/r/1229536 [10:58:29] (03CR) 10CI reject: [V:04-1] admin: Enforce FIDO key requirement as a test [puppet] - 10https://gerrit.wikimedia.org/r/1229536 (owner: 10Majavah) [10:59:10] (03CR) 10Brouberol: [C:03+1] Update dse-k8s-eqiad airflow values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229524 (https://phabricator.wikimedia.org/T411989) (owner: 10Joal) [10:59:41] (03CR) 10Pmiazga: [C:03+1] "Everything looks good, tested locally. Altough I have a question:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T1100) [11:00:11] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 3 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11540714 (10elukey) >>! In T412951#11539601, @Scott_French wrote: > Thank you very much @elukey - that's gr... [11:03:00] (03CR) 10FNegri: maintain-views: Show il_target_id references in linktarget (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229221 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [11:03:22] (03CR) 10FNegri: maintain-views: Show il_target_id references in linktarget (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229221 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [11:07:31] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11540739 (10MoritzMuehlenhoff) [11:11:33] (03CR) 10Ladsgroup: [C:03+1] "if you could deploy it, I'd appreciate it." [puppet] - 10https://gerrit.wikimedia.org/r/1229221 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [11:11:54] (03PS1) 10Muehlenhoff: Remove obsolete comment [puppet] - 10https://gerrit.wikimedia.org/r/1229537 [11:14:19] !log installing curl bugfix updates from Bookworm point release [11:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:47] FIRING: DruidWebrequestSampledNoEvents: Zero webrequest_sampled events received by druid_analytics over the last 30 minutes. ... [11:15:47] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_webrequest_sampled_live_Supervisor - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=41&fullscreen&orgId=1&var-druid_datasource=webrequest_sampled_live - https://alerts.wikimedia.org/?q=alertname%3DDruidWebrequestSampledNoEvents [11:16:44] that's expected :) [11:17:10] (03CR) 10Daniel Kinzler: "If x-client-ip is not set, this means "this request is internal, no rate limiting should apply". Since in that case rate limiting is entir" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [11:17:27] (03CR) 10Majavah: [C:03+1] Remove obsolete comment [puppet] - 10https://gerrit.wikimedia.org/r/1229537 (owner: 10Muehlenhoff) [11:18:28] (03PS1) 10MVernon: Swift: remove 3 drained codfw hosts for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1229538 (https://phabricator.wikimedia.org/T354872) [11:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:20:16] (03CR) 10Marostegui: [C:03+1] Swift: remove 3 drained codfw hosts for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1229538 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [11:21:26] (03CR) 10Ladsgroup: [C:03+1] Swift: remove 3 drained codfw hosts for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1229538 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [11:22:07] (03PS1) 10Btullis: Upgrade druid to version 26.0.0 on the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1229539 (https://phabricator.wikimedia.org/T278056) [11:22:52] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7923/co" [puppet] - 10https://gerrit.wikimedia.org/r/1229539 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [11:23:49] (03CR) 10MVernon: [C:03+2] Swift: remove 3 drained codfw hosts for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1229538 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [11:26:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-druid1003.eqiad.wmnet with OS bookworm [11:27:34] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-druid1004.eqiad.wmnet with OS bookworm [11:27:52] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-druid1005.eqiad.wmnet with OS bookworm [11:28:05] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-druid1006.eqiad.wmnet with OS bookworm [11:28:16] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-druid1007.eqiad.wmnet with OS bookworm [11:28:58] (03CR) 10Joal: "One nit" [puppet] - 10https://gerrit.wikimedia.org/r/1229539 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [11:29:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2074.codfw.wmnet with OS bullseye [11:29:27] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11540787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2074.codfw.wm... [11:29:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2074 [11:30:21] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [11:30:38] (03CR) 10Daniel Kinzler: noop: Improve APIGW/RESTGW README file (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229534 (owner: 10Pmiazga) [11:32:06] (03PS2) 10Btullis: Upgrade druid to version 26.0.0 on the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1229539 (https://phabricator.wikimedia.org/T278056) [11:32:56] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7924/co" [puppet] - 10https://gerrit.wikimedia.org/r/1229539 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [11:33:03] (03CR) 10Btullis: Upgrade druid to version 26.0.0 on the analytics cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1229539 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [11:33:17] (03CR) 10Daniel Kinzler: noop: Improve APIGW/RESTGW README file (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229534 (owner: 10Pmiazga) [11:34:32] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2074 - mvernon@cumin2002" [11:34:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2074 - mvernon@cumin2002" [11:34:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:34:38] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2074.codfw.wmnet 137.0.192.10.in-addr.arpa 7.3.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:34:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2074.codfw.wmnet 137.0.192.10.in-addr.arpa 7.3.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:34:43] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2074 [11:35:04] (03CR) 10Btullis: [C:03+2] Upgrade druid to version 26.0.0 on the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1229539 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [11:37:46] mvernon@cumin2002 reimage (PID 3366813) is awaiting input [11:38:15] jouncebot: nowandnext [11:38:15] For the next 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T1100) [11:38:16] In 0 hour(s) and 21 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T1200) [11:39:37] Anyone mind if I use scap to deploy a no-op config change? [11:39:47] (03CR) 10Dreamy Jazz: [C:03+1] Remove unused LoginNotify config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229252 (https://phabricator.wikimedia.org/T412939) (owner: 10Tim Starling) [11:40:13] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1006.eqiad.wmnet with reason: host reimage [11:40:47] (03CR) 10Dreamy Jazz: [C:03+1] "The config values are already set to this value since 8afd8f865c5287a46e174ca3b41a79bee4e063d6 (merged in April 2024)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229252 (https://phabricator.wikimedia.org/T412939) (owner: 10Tim Starling) [11:41:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229252 (https://phabricator.wikimedia.org/T412939) (owner: 10Tim Starling) [11:41:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11540839 (10Arrbee) This is an approved request for @GGalofre-WMF [11:42:37] (03Merged) 10jenkins-bot: Remove unused LoginNotify config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229252 (https://phabricator.wikimedia.org/T412939) (owner: 10Tim Starling) [11:43:06] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1229252|Remove unused LoginNotify config (T412939)]] [11:43:10] T412939: Drop support for reading CheckUser tables from LoginNotify - https://phabricator.wikimedia.org/T412939 [11:44:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2074 [11:44:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2074 [11:44:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1006.eqiad.wmnet with reason: host reimage [11:45:19] !log dreamyjazz@deploy2002 dreamyjazz, tstarling: Backport for [[gerrit:1229252|Remove unused LoginNotify config (T412939)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:45:35] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1005.eqiad.wmnet with reason: host reimage [11:46:24] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-druid1007.eqiad.wmnet with OS bookworm [11:46:57] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-druid1007.eqiad.wmnet with OS bookworm [11:47:36] !log dreamyjazz@deploy2002 dreamyjazz, tstarling: Continuing with sync [11:49:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1005.eqiad.wmnet with reason: host reimage [11:50:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2076.codfw.wmnet with OS bullseye [11:50:55] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1003.eqiad.wmnet with reason: host reimage [11:51:00] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11540850 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2076.codfw.wmnet with OS bullseye [11:51:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2076 [11:51:26] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [11:51:45] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1229252|Remove unused LoginNotify config (T412939)]] (duration: 08m 39s) [11:51:45] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229279 (owner: 10PipelineBot) [11:51:50] T412939: Drop support for reading CheckUser tables from LoginNotify - https://phabricator.wikimedia.org/T412939 [11:51:59] I'm done with my config patch deploy [11:52:16] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1004.eqiad.wmnet with reason: host reimage [11:53:33] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229279 (owner: 10PipelineBot) [11:55:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1003.eqiad.wmnet with reason: host reimage [11:55:22] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2076 - mvernon@cumin2002" [11:55:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2076 - mvernon@cumin2002" [11:55:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:55:28] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2076.codfw.wmnet 248.16.192.10.in-addr.arpa 8.4.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:55:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2076.codfw.wmnet 248.16.192.10.in-addr.arpa 8.4.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:55:33] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2076 [11:55:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2076 [11:55:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2076 [11:58:20] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1007.eqiad.wmnet with reason: host reimage [11:58:29] !log Run of medium.dblist for T413868 has completed [11:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:34] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [11:58:42] !log Stopped run of script for T413868 on large.dblist [11:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1004.eqiad.wmnet with reason: host reimage [11:59:43] !log Running `foreachwikiindblist s4.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 3000 --sleep 2` for T413868 [11:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T1200). [12:01:08] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:01:41] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:03:34] !log Running `foreachwikiindblist s1.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 10000 --sleep 5` for T413868 [12:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1007.eqiad.wmnet with reason: host reimage [12:03:39] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [12:04:22] !log Running `foreachwikiindblist s1.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 5000 --sleep 2` for T413868 [12:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:29] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Set limits on memory usage to avoid latency increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227799 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [12:04:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1006.eqiad.wmnet with OS bookworm [12:04:49] (03CR) 10Gkyziridis: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229477 (https://phabricator.wikimedia.org/T414431) (owner: 10Bartosz Wójtowicz) [12:05:01] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229491 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [12:06:28] (03Merged) 10jenkins-bot: mobileapps: Set limits on memory usage to avoid latency increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227799 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [12:09:28] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete comment [puppet] - 10https://gerrit.wikimedia.org/r/1229537 (owner: 10Muehlenhoff) [12:09:56] !log Running `foreachwikiindblist s2.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 5000 --sleep 1` for T413868 [12:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:01] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [12:10:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1005.eqiad.wmnet with OS bookworm [12:10:58] !log Running `foreachwikiindblist s3.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 5000 --sleep 1` for T413868 [12:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:17] RESOLVED: DruidWebrequestSampledNoEvents: Zero webrequest_sampled events received by druid_analytics over the last 30 minutes. ... [12:12:17] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_webrequest_sampled_live_Supervisor - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=41&fullscreen&orgId=1&var-druid_datasource=webrequest_sampled_live - https://alerts.wikimedia.org/?q=alertname%3DDruidWebrequestSampledNoEvents [12:12:48] (03PS1) 10Jgiannelos: mobileapps: Disable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229554 [12:15:20] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:15:46] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:15:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1003.eqiad.wmnet with OS bookworm [12:16:08] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:16:37] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:18:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1004.eqiad.wmnet with OS bookworm [12:20:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1007.eqiad.wmnet with OS bookworm [12:28:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:30:19] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Lower resource limits for article descriptions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229477 (https://phabricator.wikimedia.org/T414431) (owner: 10Bartosz Wójtowicz) [12:30:25] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update image for Article Topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229491 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [12:32:09] (03Merged) 10jenkins-bot: ml-services: Lower resource limits for article descriptions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229477 (https://phabricator.wikimedia.org/T414431) (owner: 10Bartosz Wójtowicz) [12:32:14] (03Merged) 10jenkins-bot: ml-services: Update image for Article Topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229491 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [12:33:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:41:26] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2074.codfw.wmnet with OS bullseye [12:41:32] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2076.codfw.wmnet with OS bullseye [12:41:35] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11541010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2074.codfw.wmnet with OS bullseye execu... [12:41:42] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11541011 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2076.codfw.wmnet with OS bullseye execu... [12:42:11] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189 (10MatthewVernon) 03NEW [12:42:45] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189#11541028 (10MatthewVernon) [12:42:48] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11541029 (10MatthewVernon) [12:42:52] (03CR) 10Kamila Součková: "Quick first pass so you get these ASAP, I'll do a 2nd pass." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [12:43:02] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189#11541030 (10MatthewVernon) p:05Triage→03High [priority because I do need to be able to reliably reimage swift nodes] [12:51:31] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218293 (owner: 10PipelineBot) [12:51:42] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224693 (owner: 10PipelineBot) [12:51:52] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226195 (owner: 10PipelineBot) [12:58:11] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:02:09] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:05:51] I have a patch to deploy. Would anyone mind if I do it now? [13:07:04] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:08:39] !log Deploying mitigation for T414547 [13:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:29] !log mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 skin|thumbsize [13:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:13] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:14:58] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:15:26] !log mszwarc Deployed security patch for T414547 [13:18:34] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:21:26] !log mszwarc Deployed security patch for T414547 [13:21:48] (03PS1) 10Bartosz Wójtowicz: ml-services: Add replicas for article topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229560 (https://phabricator.wikimedia.org/T414573) [13:22:02] !log Finished deploying for T414547 [13:22:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:25:00] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie [13:26:58] (03CR) 10Subramanya Sastry: [C:03+2] mobileapps: Disable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229554 (owner: 10Jgiannelos) [13:28:05] (03CR) 10Kamila Součková: "another partial review round" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [13:28:10] (03CR) 10Dpogorzelski: [C:03+1] role::ml_builder: add docker engine settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229531 (https://phabricator.wikimedia.org/T385173) (owner: 10Elukey) [13:28:46] (03Merged) 10jenkins-bot: mobileapps: Disable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229554 (owner: 10Jgiannelos) [13:30:22] (03PS1) 10Cathal Mooney: Add Nokia BGP routing policy for wikikube-worker / k8s hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1229562 (https://phabricator.wikimedia.org/T408757) [13:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:28] (03CR) 10Slyngshede: "This can also be "fixed" by wrapping the maxminddb.open with pcall, but in my testing this confuses the code quite a bit." [puppet] - 10https://gerrit.wikimedia.org/r/1224897 (https://phabricator.wikimedia.org/T414111) (owner: 10Slyngshede) [13:40:44] !log Running `foreachwikiindblist s5.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 5000 --sleep 1` for T413868 [13:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:49] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [13:41:23] (03PS2) 10Cathal Mooney: Add Nokia BGP routing policy for wikikube-worker / k8s hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1229562 (https://phabricator.wikimedia.org/T408757) [13:41:30] !log Run of script for T413868 has finished on s2 [13:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:47] (03CR) 10Gkyziridis: [C:03+1] "LGTM thnx" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229560 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [13:42:04] !log Run of script for T413868 has finished on s3 [13:42:06] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Add replicas for article topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229560 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [13:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:27] (03CR) 10Vgutierrez: [C:03+2] cache::upload: enable global ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1228574 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:42:54] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1032.eqiad.wmnet with reason: host reimage [13:43:36] !log Running `foreachwikiindblist s6.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 5000 --sleep 1` for T413868 [13:43:40] (03Merged) 10jenkins-bot: ml-services: Add replicas for article topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229560 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [13:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:17] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:44:48] !log Running `foreachwikiindblist s7.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 5000 --sleep 1` for T413868 [13:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:43] !log Running `foreachwikiindblist s8.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 3000 --sleep 2` for T413868 [13:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:04] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:46:52] (03PS2) 10Dpogorzelski: role::ml_builder: add docker engine settings [puppet] - 10https://gerrit.wikimedia.org/r/1229531 (https://phabricator.wikimedia.org/T385173) (owner: 10Elukey) [13:47:08] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229531 (https://phabricator.wikimedia.org/T385173) (owner: 10Elukey) [13:47:36] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:48:28] !log installing qemu security updates [13:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:25] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 3 others: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet - https://phabricator.wikimedia.org/T409102#11541221 (10Blake) [13:50:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1032.eqiad.wmnet with reason: host reimage [13:51:01] 06SRE, 06Infrastructure-Foundations, 10Mail, 06ServiceOps new: Sendmail network error (deployment) - https://phabricator.wikimedia.org/T407723#11541239 (10Blake) [13:52:02] (03PS3) 10Dpogorzelski: role::ml_builder: add docker engine settings [puppet] - 10https://gerrit.wikimedia.org/r/1229531 (https://phabricator.wikimedia.org/T385173) (owner: 10Elukey) [13:52:12] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229531 (https://phabricator.wikimedia.org/T385173) (owner: 10Elukey) [13:55:32] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:56:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [13:57:01] (03PS3) 10Vgutierrez: cache::upload: enable global ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1228575 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:59:31] (03PS2) 10Jforrester: wikifunctions: Update check-wf-services.sh to also check the v2 endpoint for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229210 (https://phabricator.wikimedia.org/T414589) [13:59:31] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-01-07-132938 to 2026-01-15-194836 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229568 (https://phabricator.wikimedia.org/T394557) [13:59:33] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-01-07-163903 to 2026-01-21-135031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229569 (https://phabricator.wikimedia.org/T413728) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T1400) [14:00:05] JSherman: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] o/ [14:00:17] o/ [14:00:36] JSherman: want to deploy your change yourself? [14:00:50] yep, I'll get it rolling! [14:01:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [extensions/PageTriage] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229172 (https://phabricator.wikimedia.org/T414892) (owner: 10Jsn.sherman) [14:01:21] (03PS1) 10Majavah: P:dumps::distribution: Only use IP addresses in rsync config [puppet] - 10https://gerrit.wikimedia.org/r/1229570 [14:02:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [14:02:19] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7925/co" [puppet] - 10https://gerrit.wikimedia.org/r/1229570 (owner: 10Majavah) [14:04:45] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228575 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:06:16] (03PS1) 10Jgiannelos: mobileapps: Use max-old-space-size instead [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229572 [14:06:30] (03CR) 10CI reject: [V:04-1] mobileapps: Use max-old-space-size instead [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229572 (owner: 10Jgiannelos) [14:06:42] (03PS17) 10Daniel Kinzler: charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) [14:07:03] (03CR) 10Federico Ceratto: [C:03+2] admin: update astein SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1229195 (https://phabricator.wikimedia.org/T414830) (owner: 10Federico Ceratto) [14:07:08] (03PS2) 10Jgiannelos: mobileapps: Use max-old-space-size instead [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229572 [14:07:45] (03CR) 10Jgiannelos: "It looks like docs were a bit misleading. The percentage version of the flag is available in node22 but only in minor version greater than" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229572 (owner: 10Jgiannelos) [14:08:22] 06SRE, 13Patch-For-Review: please update astein puppet ssh key - https://phabricator.wikimedia.org/T414830#11541309 (10FCeratto-WMF) 05Open→03Resolved Change deployed on Puppet, closing task. [14:08:26] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1032.eqiad.wmnet with OS trixie [14:08:43] (03CR) 10CI reject: [V:04-1] charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [14:08:47] (03PS1) 10Muehlenhoff: Remove documentation about the legacy way to configure TLS for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1229573 (https://phabricator.wikimedia.org/T357750) [14:10:04] (03CR) 10Vgutierrez: [C:03+2] cache::upload: enable global ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1228575 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:13:20] (03Merged) 10jenkins-bot: Use page creation date as offset only if the legacy ordering is specified [extensions/PageTriage] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229172 (https://phabricator.wikimedia.org/T414892) (owner: 10Jsn.sherman) [14:13:27] (03PS2) 10Federico Ceratto: admin: Add johannnes89 to LDAP [puppet] - 10https://gerrit.wikimedia.org/r/1229200 (https://phabricator.wikimedia.org/T414789) [14:13:40] (03PS1) 10Majavah: P:dumps::distribution: Manage firewall allowlists like rsync allowlists [puppet] - 10https://gerrit.wikimedia.org/r/1229574 [14:13:57] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1229172|Use page creation date as offset only if the legacy ordering is specified (T414892)]] [14:14:02] T414892: New Pages Feed Rollover - https://phabricator.wikimedia.org/T414892 [14:14:14] (03PS3) 10Federico Ceratto: admin: Add johannnes89 to LDAP [puppet] - 10https://gerrit.wikimedia.org/r/1229200 (https://phabricator.wikimedia.org/T414789) [14:14:30] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7927/co" [puppet] - 10https://gerrit.wikimedia.org/r/1229574 (owner: 10Majavah) [14:14:50] (03CR) 10Federico Ceratto: "Updated as discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1229200 (https://phabricator.wikimedia.org/T414789) (owner: 10Federico Ceratto) [14:16:06] (03PS18) 10Daniel Kinzler: charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) [14:16:10] !log jsn@deploy2002 jsn: Backport for [[gerrit:1229172|Use page creation date as offset only if the legacy ordering is specified (T414892)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:16:14] (03CR) 10Daniel Kinzler: charts: add redioscope chart and service (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [14:16:37] testing [14:17:31] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11541337 (10FCeratto-WMF) 05Open→03Resolved I'm closing the task as resolved for now. @dr0ptp4kt if there's any issue please reopen the task. [14:18:07] (03CR) 10Daniel Kinzler: charts: add redioscope chart and service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [14:18:31] (03CR) 10Andrew Bogott: [C:03+1] cloudnfs: Add wikiqlever project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/1229159 (https://phabricator.wikimedia.org/T414986) (owner: 10FNegri) [14:18:46] (03CR) 10Bking: [C:03+2] data-platform: Show affected DC on blackbox alerts [alerts] - 10https://gerrit.wikimedia.org/r/1229203 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [14:19:30] (03PS19) 10Daniel Kinzler: charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) [14:20:25] (03CR) 10Elukey: [C:03+2] role::ml_builder: add docker engine settings [puppet] - 10https://gerrit.wikimedia.org/r/1229531 (https://phabricator.wikimedia.org/T385173) (owner: 10Elukey) [14:20:51] (03CR) 10Muehlenhoff: "Looks good, but let's add an comment to help understand this in the future" [puppet] - 10https://gerrit.wikimedia.org/r/1229200 (https://phabricator.wikimedia.org/T414789) (owner: 10Federico Ceratto) [14:21:47] !log pt1979@cumin1003 START - Cookbook sre.hosts.reimage for host mwlog1003.eqiad.wmnet with OS bookworm [14:22:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11541354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm [14:22:13] (03CR) 10Daniel Kinzler: charts: add redioscope chart and service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [14:23:35] JSherman: Appears to work I think (on k8s-mwdebug)! [14:24:04] I'm testing on testwiki since it has the latest branch [14:24:14] I'm doing a few runs [14:24:42] Ah got it! [14:24:52] I was able to reproduce it on testwiki with the userscript [14:25:13] but intermittently, so I'm just making sure I get a few runs with no dups [14:25:23] (while on the debug host) [14:25:57] 👍 [14:27:15] !log jsn@deploy2002 jsn: Continuing with sync [14:27:33] Sohom_Datta: I think we got it [14:27:42] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11541385 (10MoritzMuehlenhoff) [14:28:07] debug off had 2/5 runs with dups, debug on had 0/5 runs with dups [14:28:41] not conclusive, but I didn't see any regressions, either [14:28:58] (03CR) 10Kamila Součková: [C:03+2] charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [14:30:17] Sohom_Datta: I don't think you'll see this on enwiki until 1.46.0-wmf.12 rolls to group 2 [14:30:58] Huh... Do you want me to cherrypick it to .11 ? [14:31:00] (03Merged) 10jenkins-bot: charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [14:31:13] Or do we wait until .12 [14:31:22] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1229172|Use page creation date as offset only if the legacy ordering is specified (T414892)]] (duration: 17m 24s) [14:31:27] T414892: New Pages Feed Rollover - https://phabricator.wikimedia.org/T414892 [14:32:35] Sohom_Datta: yeah, I just thought we would wait until it rolls tomorrow. I wasn't sure how this would act on .11 [14:32:47] Got it! [14:33:18] Lucas_WMDE: done [14:33:39] !log UTC afternoon backport+config window done [14:33:41] thanks! \o/ [14:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:52] (I don’t see anything else to deploy in the calendar) [14:33:52] !log daniel@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [14:33:54] !log daniel@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [14:34:22] !log daniel@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [14:34:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:15] (03PS2) 10Muehlenhoff: conf/etcd: Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1227307 (https://phabricator.wikimedia.org/T352245) [14:36:58] !log Run of script for T413868 has finished on s5 [14:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:10] !log Run of script for T413868 has finished on s6 [14:37:11] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [14:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:40] !log Run of script for T413868 has finished on s7 [14:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11541938 (10ops-monitoring-bot) Roll-reboot of nodes in dse-eqiad cluster started by btullis: * dse-k8... [14:40:44] !log daniel@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [14:40:45] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11541963 (10ops-monitoring-bot) Roll-reboot of nodes in dse-eqiad cluster started by btullis: * dse-k8... [14:40:48] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker[1008-1019].eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [14:41:04] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11541988 (10ops-monitoring-bot) Roll-reboot of nodes in dse-eqiad cluster started by btullis: * dse-k8... [14:41:14] !log daniel@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [14:42:49] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{dse-k8s-worker[1008-1019].eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [14:45:11] !log daniel@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [14:47:01] !log kamila@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [14:47:18] !log kamila@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [14:47:23] !log kamila@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [14:50:01] !log `[samtar@deploy2002 ~]$ mwscript sql.php --wiki=testwiki /srv/mediawiki/php-1.46.0-wmf.12/sql/mysql/patch-watchlist_label.sql` for T406843 [14:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:06] T406843: Create watchlist labels database tables - https://phabricator.wikimedia.org/T406843 [14:55:28] (03CR) 10FNegri: maintain-views: Show il_target_id references in linktarget (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229221 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [14:55:37] (03CR) 10FNegri: [C:03+2] maintain-views: Show il_target_id references in linktarget [puppet] - 10https://gerrit.wikimedia.org/r/1229221 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [14:57:22] (03PS6) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) [14:57:31] !log kamila@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [14:58:15] !log kamila@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [14:58:26] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [15:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T1500) [15:00:34] (03Merged) 10jenkins-bot: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [15:00:51] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update check-wf-services.sh to also check the v2 endpoint for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229210 (https://phabricator.wikimedia.org/T414589) (owner: 10Jforrester) [15:01:36] !log kamila@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [15:02:39] !log kamila@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [15:02:46] (03Merged) 10jenkins-bot: wikifunctions: Update check-wf-services.sh to also check the v2 endpoint for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229210 (https://phabricator.wikimedia.org/T414589) (owner: 10Jforrester) [15:02:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227307 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [15:03:18] (03PS4) 10Daniel Kinzler: rest gateway: include a meaningful body with 429 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226827 (https://phabricator.wikimedia.org/T405636) [15:03:38] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-01-07-132938 to 2026-01-15-194836 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229568 (https://phabricator.wikimedia.org/T394557) (owner: 10Jforrester) [15:04:14] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: include a meaningful body with 429 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226827 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [15:05:34] (03CR) 10Muehlenhoff: "Thanks for that! Let's move this forward once all legacy keys are cleared out" [puppet] - 10https://gerrit.wikimedia.org/r/1229536 (owner: 10Majavah) [15:05:36] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-01-07-132938 to 2026-01-15-194836 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229568 (https://phabricator.wikimedia.org/T394557) (owner: 10Jforrester) [15:06:05] (03CR) 10FNegri: [C:03+2] "This change is now live on all clouddb* hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1229221 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [15:06:20] (03Merged) 10jenkins-bot: rest gateway: include a meaningful body with 429 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226827 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [15:06:55] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:07:27] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:07:28] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1008.eqiad.wmnet [15:07:39] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:07:52] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11542145 (10ops-monitoring-bot) Host dse-k8s-worker1008.eqiad.wmnet rebooted by btullis@cumin1003 with... [15:08:19] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:08:25] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:09:07] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:09:13] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:19] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-01-07-163903 to 2026-01-21-135031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229569 (https://phabricator.wikimedia.org/T413728) (owner: 10Jforrester) [15:10:48] (03CR) 10Elukey: [C:03+1] Remove documentation about the legacy way to configure TLS for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1229573 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [15:10:49] pt1979@cumin1003 reimage (PID 2919783) is awaiting input [15:11:22] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-01-07-163903 to 2026-01-21-135031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229569 (https://phabricator.wikimedia.org/T413728) (owner: 10Jforrester) [15:12:21] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:12:41] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:12:45] !log kamila@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [15:13:24] (03CR) 10Muehlenhoff: [C:03+2] conf/etcd: Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1227307 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [15:14:35] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:15:06] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:15:12] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:15:45] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:18:18] 10SRE-tools, 10DNS, 06Infrastructure-Foundations, 06Traffic, 13Patch-Needs-Improvement: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761#11542183 (10BCornwall) 05Stalled→03Resolved a:03BCornwall I believe this has been solved with the lates... [15:18:25] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:18:49] 06SRE, 10Prod-Kubernetes, 06ServiceOps new, 06SRE Observability (FY2025/2026-Q3): write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#11542187 (10hnowlan) [15:19:08] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:20:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1008.eqiad.wmnet [15:20:09] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1008.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:23:07] (03PS3) 10Scott French: mobileapps: Use max-old-space-size instead [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229572 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [15:23:52] (03CR) 10Scott French: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229572 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [15:25:39] (03PS2) 10Muehlenhoff: conf/etcd: Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1227309 (https://phabricator.wikimedia.org/T352245) [15:26:21] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker[1009-1019].eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [15:26:34] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11542216 (10ops-monitoring-bot) Roll-reboot of nodes in dse-eqiad cluster started by btullis: * dse-k8... [15:29:01] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Use max-old-space-size instead [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229572 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [15:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T1500) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T1530) [15:30:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227309 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [15:30:27] (03Merged) 10jenkins-bot: mobileapps: Use max-old-space-size instead [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229572 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [15:33:35] !log started `[samtar@deploy2002 ~]$ foreachwiki sql.php /srv/mediawiki/php-1.46.0-wmf.12/sql/mysql/patch-watchlist_label.sql` for T406843 [15:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:40] T406843: Create watchlist labels database tables - https://phabricator.wikimedia.org/T406843 [15:34:13] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:23] (03PS1) 10Kamila Součková: failoid-ng: add namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229588 [15:38:40] !log foreachwiki ... completed for T406843 [15:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:45] T406843: Create watchlist labels database tables - https://phabricator.wikimedia.org/T406843 [15:41:53] (03CR) 10Muehlenhoff: [C:03+2] conf/etcd: Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1227309 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [15:42:14] (03PS1) 10Elukey: role::puppetserver: add the analytics-sre user key and configs [puppet] - 10https://gerrit.wikimedia.org/r/1229590 (https://phabricator.wikimedia.org/T402512) [15:43:14] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229590 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:44:19] (03CR) 10Filippo Giunchedi: [C:03+1] P:dumps::distribution: Only use IP addresses in rsync config [puppet] - 10https://gerrit.wikimedia.org/r/1229570 (owner: 10Majavah) [15:45:38] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11542351 (10MoritzMuehlenhoff) [15:47:02] (03PS1) 10Papaul: Troubleshooting install on mwlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1229592 (https://phabricator.wikimedia.org/T412230) [15:49:28] (03CR) 10Papaul: [C:03+2] Troubleshooting install on mwlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1229592 (https://phabricator.wikimedia.org/T412230) (owner: 10Papaul) [15:50:08] (03PS2) 10Elukey: role::puppetserver: add the analytics-sre user key and configs [puppet] - 10https://gerrit.wikimedia.org/r/1229590 (https://phabricator.wikimedia.org/T402512) [15:50:27] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229590 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:51:06] (03CR) 10Majavah: [V:03+1 C:03+2] P:dumps::distribution: Only use IP addresses in rsync config [puppet] - 10https://gerrit.wikimedia.org/r/1229570 (owner: 10Majavah) [15:53:59] (03PS2) 10Majavah: P:dumps::distribution: Manage firewall allowlists like rsync allowlists [puppet] - 10https://gerrit.wikimedia.org/r/1229574 [15:54:06] (03CR) 10Cathal Mooney: [C:03+2] plugins/wmf-netbox: remove ipv4 only for DNS hosts BGP [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1228518 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:54:42] (03CR) 10FNegri: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7929/console" [puppet] - 10https://gerrit.wikimedia.org/r/1229159 (https://phabricator.wikimedia.org/T414986) (owner: 10FNegri) [15:54:57] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7928/co" [puppet] - 10https://gerrit.wikimedia.org/r/1229574 (owner: 10Majavah) [15:55:39] (03CR) 10Brouberol: [C:03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1229573 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [15:56:14] (03PS3) 10Majavah: P:dumps::distribution: Manage firewall allowlists like rsync allowlists [puppet] - 10https://gerrit.wikimedia.org/r/1229574 [15:56:31] (03CR) 10Elukey: "Hey folks, this is the first bit of https://phabricator.wikimedia.org/T402512#11537796, using the analytics-sre user that we already have " [puppet] - 10https://gerrit.wikimedia.org/r/1229590 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:57:42] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Update Homer plugin to enable IPv6 peering to authdns boxes - cmooney@cumin1003 [15:58:44] (03CR) 10Muehlenhoff: "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1229574 (owner: 10Majavah) [15:59:19] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Update Homer plugin to enable IPv6 peering to authdns boxes - cmooney@cumin1003 [15:59:43] (03CR) 10Majavah: P:dumps::distribution: Manage firewall allowlists like rsync allowlists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229574 (owner: 10Majavah) [16:00:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1229574 (owner: 10Majavah) [16:00:57] (03CR) 10Majavah: [C:03+2] P:dumps::distribution: Manage firewall allowlists like rsync allowlists [puppet] - 10https://gerrit.wikimedia.org/r/1229574 (owner: 10Majavah) [16:01:57] !log installing docker.io updates from Bookworm point release [16:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:29] (03CR) 10FNegri: [V:03+1 C:03+2] cloudnfs: Add wikiqlever project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/1229159 (https://phabricator.wikimedia.org/T414986) (owner: 10FNegri) [16:08:09] (03PS1) 10Majavah: dumps: rsync: Simplify configuration handling logic [puppet] - 10https://gerrit.wikimedia.org/r/1229600 [16:11:50] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11542431 (10Jhancock.wm) [16:11:51] (03CR) 10Muehlenhoff: [C:03+2] Remove documentation about the legacy way to configure TLS for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1229573 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [16:17:17] !log start populating il_target_id on s1, s2, s4, s7 and s8 wikis # T413668 [16:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:22] T413668: Run the data migration of imagelinks - https://phabricator.wikimedia.org/T413668 [16:19:00] (03PS2) 10Majavah: dumps: rsync: Simplify configuration handling logic [puppet] - 10https://gerrit.wikimedia.org/r/1229600 [16:19:54] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7931/co" [puppet] - 10https://gerrit.wikimedia.org/r/1229600 (owner: 10Majavah) [16:21:43] !log pt1979@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mwlog1003.eqiad.wmnet with OS bookworm [16:22:00] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11542486 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookw... [16:22:05] !log pt1979@cumin1003 START - Cookbook sre.hosts.reimage for host mwlog1003.eqiad.wmnet with OS bookworm [16:22:19] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11542488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS b... [16:22:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1229590 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [16:23:22] Hi. The bots are returning errors on the following two commits, and it doesn’t seem to be related to me. Can someone help me with this?https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaUploader/+/1225643 [16:23:22] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FacetedCategory/+/1225653 [16:24:53] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2354.codfw.wmnet with OS bookworm [16:25:10] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2355.codfw.wmnet with OS bookworm [16:25:11] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11542498 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2354.codfw.wmnet with OS book... [16:25:27] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2356.codfw.wmnet with OS bookworm [16:25:32] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11542507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2355.codfw.wmnet with OS book... [16:25:44] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11542509 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2356.codfw.wmnet with OS book... [16:25:45] Neriah: looks like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaUploader/+/1221300 tried to fix those issues but got stuck (also, I don’t think this is the right channel for that question) [16:26:06] yeah, CI is working correctly, the issues are in those extensions [16:26:15] !log dancy@deploy2002 Installing scap version "4.234.0" for 2 host(s) [16:28:06] !log dancy@deploy2002 Installation of scap version "4.234.0" completed for 2 hosts [16:28:31] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:29:45] taavi and Lucas_WMDE , Does this mean that changes to these extensions are blocked until someone addresses the issues? Or is it possible to skip the CI in certain cases? [16:30:30] Also, going forward, what is the appropriate channel for questions like this? [16:31:16] #mediawiki is probably a good place, I think [16:31:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:31:23] (03PS1) 10Zabe: Start reading from il_target_id on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229607 (https://phabricator.wikimedia.org/T413669) [16:31:39] and it’s technically possible to bypass CI but IMHO it wouldn’t be warranted here. those issues should be fixed [16:36:21] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage [16:36:32] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2355.codfw.wmnet with reason: host reimage [16:37:44] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:38:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189#11542533 (10jhathaway) a:03jhathaway @MatthewVernon looking... [16:40:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage [16:40:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11542548 (10KReid-WMF) Hi - I've checked and I'm able to log in and see the test kitchen staging environment. Thanks! [16:40:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:43:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87833 and previous config saved to /var/cache/conftool/dbconfig/20260121-164311-marostegui.json [16:43:19] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:43:20] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:44:08] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2355.codfw.wmnet with reason: host reimage [16:46:20] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11542556 (10MoritzMuehlenhoff) [16:51:27] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2354.codfw.wmnet with OS bookworm [16:51:46] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11542591 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2354.codfw.wmnet with OS bookworm... [16:53:15] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2354.codfw.wmnet with OS bookworm [16:53:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P87834 and previous config saved to /var/cache/conftool/dbconfig/20260121-165319-marostegui.json [16:53:31] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11542594 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2354.codfw.wmnet with OS book... [16:53:52] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2074.codfw.wmnet with OS bullseye [16:53:55] !log jhathaway@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2074 [16:53:56] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2074 [16:59:29] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:59:43] !log pt1979@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mwlog1003.eqiad.wmnet with OS bookworm [16:59:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11542600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm executed with erro... [17:01:53] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [17:03:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P87835 and previous config saved to /var/cache/conftool/dbconfig/20260121-170328-marostegui.json [17:04:57] jhancock@cumin1003 reimage (PID 2934440) is awaiting input [17:05:21] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage [17:06:18] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [17:06:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2355.codfw.wmnet with OS bookworm [17:06:36] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11542645 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2355.codfw.wmnet with OS bookworm... [17:06:55] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update IP for mwlog1003 - pt1979@cumin2002" [17:07:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update IP for mwlog1003 - pt1979@cumin2002" [17:07:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:08] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [17:08:20] !log pt1979@cumin1003 START - Cookbook sre.hosts.reimage for host mwlog1003.eqiad.wmnet with OS bookworm [17:08:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11542650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm [17:09:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage [17:09:51] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:11:56] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2074.codfw.wmnet with reason: host reimage [17:13:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87836 and previous config saved to /var/cache/conftool/dbconfig/20260121-171336-marostegui.json [17:13:45] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:13:46] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:13:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [17:14:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87837 and previous config saved to /var/cache/conftool/dbconfig/20260121-171401-marostegui.json [17:16:00] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2074.codfw.wmnet with reason: host reimage [17:22:16] FIRING: [2x] PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-codfw and (null) (10.195.0.248) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:27:28] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [17:30:33] jhancock@cumin1003 reimage (PID 2939107) is awaiting input [17:34:56] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2074.codfw.wmnet with OS bullseye [17:37:52] !log sudo cumin "A:dnsbox" "disable-puppet 'merging CR 1226928'": T81605 [17:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:57] T81605: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 [17:38:21] (03PS9) 10Ssingh: dnsbox: advertise ns[0-2] IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) [17:38:47] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2076.codfw.wmnet with OS bullseye [17:38:50] !log jhathaway@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2076 [17:38:51] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2076 [17:39:57] (03CR) 10Ssingh: [V:03+2 C:03+2] dnsbox: advertise ns[0-2] IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [17:41:32] (03CR) 10Jasmine: wikikube: decommission wikikube-worker[2116-2123,2216-2241].codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227454 (https://phabricator.wikimedia.org/T409104) (owner: 10Jasmine) [17:43:17] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org [reason: testing authdns IPv6 change] [17:45:41] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2356.codfw.wmnet with OS bookworm [17:46:02] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11542754 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2356.codfw.wmnet with OS bookworm... [17:46:37] PROBLEM - Host 2a02:ec80:700:1:195:200:68:4 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:11] RESOLVED: [2x] PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-codfw and (null) (10.195.0.248) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:47:45] PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:45] PROBLEM - Host ns1-v6 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:45] PROBLEM - Host ns2-v6 is DOWN: PING CRITICAL - Packet loss = 100% [17:48:17] PROBLEM - Bird Internet Routing Daemon on dns7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:48:17] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:48:23] ^ yeah that's OK, there isn't really any -v6 [17:48:26] we are aware, see -sre [17:53:17] RECOVERY - Bird Internet Routing Daemon on dns7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:53:17] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 1 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:56:39] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2076.codfw.wmnet with reason: host reimage [17:59:42] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2076.codfw.wmnet with reason: host reimage [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T1800) [18:02:51] SREs: Does anyone plan to use this deployment window? ^^^ [18:02:57] If not we'd like to roll back the train [18:03:09] * swfrench-wmf thumbs up [18:03:21] nothing on my end, as the most frequent user of this window [18:03:34] alright let me roll back then [18:03:43] thanks, andre! [18:03:49] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229622 (https://phabricator.wikimedia.org/T413803) [18:03:52] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229622 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [18:04:45] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229622 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [18:06:53] pt1979@cumin1003 reimage (PID 2942014) is awaiting input [18:09:19] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [18:09:22] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2354.codfw.wmnet with OS bookworm [18:09:42] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11542797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2354.codfw.wmnet with OS bookworm... [18:11:02] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.12 refs T413803 [18:11:07] T413803: 1.46.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T413803 [18:11:22] Train rollback finished. [18:11:37] * swfrench-wmf thumbs up [18:11:54] <3 [18:13:10] yes, this has silenced the issue per https://logstash.wikimedia.org/goto/9eeed598e9b79b9ade28de309e87bd06 [18:14:43] (03PS1) 10Majavah: interface::ip: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1229624 [18:14:43] (03PS1) 10Majavah: interface::ip: Fix default prefix length for IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1229625 [18:16:24] (03PS2) 10Majavah: interface::ip: Fix default prefix length for IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1229625 [18:17:07] (03PS1) 10Ssingh: Revert "dnsbox: advertise ns[0-2] IPv6" [puppet] - 10https://gerrit.wikimedia.org/r/1229628 [18:17:37] (03PS1) 10Cathal Mooney: Revert "plugins/wmf-netbox: remove ipv4 only for DNS hosts BGP" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1229629 [18:18:35] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2076.codfw.wmnet with OS bullseye [18:18:43] (03PS2) 10Majavah: interface::ip: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1229624 [18:18:43] (03PS3) 10Majavah: interface::ip: Fix default prefix length for IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1229625 [18:19:17] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [18:19:24] (03CR) 10Cathal Mooney: [C:03+2] Revert "plugins/wmf-netbox: remove ipv4 only for DNS hosts BGP" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1229629 (owner: 10Cathal Mooney) [18:19:30] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7934/co" [puppet] - 10https://gerrit.wikimedia.org/r/1229625 (owner: 10Majavah) [18:20:27] (03CR) 10Ssingh: [C:03+2] Revert "dnsbox: advertise ns[0-2] IPv6" [puppet] - 10https://gerrit.wikimedia.org/r/1229628 (owner: 10Ssingh) [18:20:57] (03CR) 10CI reject: [V:04-1] interface::ip: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1229624 (owner: 10Majavah) [18:21:33] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Update Homer plugin to enable IPv6 peering to authdns boxes - cmooney@cumin1003 [18:22:03] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:22:32] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [18:22:35] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [18:23:07] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [18:23:11] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Update Homer plugin to enable IPv6 peering to authdns boxes - cmooney@cumin1003 [18:23:49] (03PS3) 10Majavah: interface::ip: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1229624 [18:23:50] (03PS4) 10Majavah: interface::ip: Fix default prefix length for IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1229625 [18:23:50] (03PS1) 10Majavah: hieradata: cloudceph: Set prefix length as an integer [puppet] - 10https://gerrit.wikimedia.org/r/1229633 [18:23:50] (03PS1) 10Majavah: interface::ip: Use consistent indentation [puppet] - 10https://gerrit.wikimedia.org/r/1229634 [18:24:40] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7935/console" [puppet] - 10https://gerrit.wikimedia.org/r/1229634 (owner: 10Majavah) [18:28:26] jhancock@cumin1003 netbox (PID 2952049) is awaiting input [18:30:57] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2207.codfw.wmnet with reason: Maintenance [18:31:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2207 (T410589)', diff saved to https://phabricator.wikimedia.org/P87838 and previous config saved to /var/cache/conftool/dbconfig/20260121-183104-ladsgroup.json [18:31:10] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [18:31:32] andre: Are you done with your train rollback? I'd like to deploy a private change [18:31:42] RoanKattouw, yes, done [18:31:47] go ahead :)( [18:33:44] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: corrections to codfw - jhancock@cumin1003" [18:34:15] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [18:34:23] PROBLEM - Ensure traffic_manager is running for instance backend on cp7015 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:34:32] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: corrections to codfw - jhancock@cumin1003" [18:34:32] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:34:34] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [18:34:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:23] RECOVERY - Ensure traffic_manager is running for instance backend on cp7015 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:35:35] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [18:39:18] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: corrections to codfw - jhancock@cumin1003" [18:39:22] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: corrections to codfw - jhancock@cumin1003" [18:39:22] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:44:15] (03Abandoned) 10C. Scott Ananian: Allow defaulting to Parsoid Read Views when MobileFrontEnd is active [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098549 (https://phabricator.wikimedia.org/T381002) (owner: 10C. Scott Ananian) [18:44:17] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [18:45:28] (03Abandoned) 10C. Scott Ananian: Enable testing LanguageConverter in sandboxes on deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438079 (https://phabricator.wikimedia.org/T143628) (owner: 10C. Scott Ananian) [18:48:50] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: corrections to codfw - jhancock@cumin1003" [18:48:55] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: corrections to codfw - jhancock@cumin1003" [18:48:55] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:50:03] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11542936 (10FCeratto-WMF) 05In progress→03Resolved Thanks, closing task. [18:51:32] (03PS2) 10C. Scott Ananian: Turn on Parsoid read views by default on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999060 (https://phabricator.wikimedia.org/T357054) [18:51:53] (03CR) 10C. Scott Ananian: "We have not yet enabled Parsoid Read Views on beta, but hopefully will do so this quarter!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999060 (https://phabricator.wikimedia.org/T357054) (owner: 10C. Scott Ananian) [18:55:09] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:59:29] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2354 [18:59:39] (03PS1) 10Daniel Kinzler: redioscope: fix survey generation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229648 [18:59:39] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2354 [18:59:42] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2355 [18:59:49] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2355 [18:59:55] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2356 [19:00:01] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2356 [19:00:04] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host mwlog2003 [19:00:05] andre and jeena: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T1900). [19:00:11] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mwlog2003 [19:00:11] jouncebot: ♫ No train today, my transcodes gone away / The smoke stands for lorn, a symbol of the dawn ♫ [19:00:26] 😆 [19:00:39] Nice [19:01:47] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2354.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:02:19] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2354.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:02:42] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2355.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:03:12] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2356.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:03:14] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2355.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:03:17] RECOVERY - Host 2a02:ec80:700:1:195:200:68:4 is UP: PING OK - Packet loss = 0%, RTA = 112.60 ms [19:07:37] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2356.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:07:58] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host dns7001.wikimedia.org [19:10:15] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:11:12] ^ reboot [19:12:54] (03CR) 10Kamila Součková: [C:03+1] redioscope: fix survey generation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229648 (owner: 10Daniel Kinzler) [19:14:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189#11543062 (10jhathaway) Strangely I re-imaged both servers from `cumin2002` and ran into no issues. Perhaps when you ran the first... [19:16:20] (03CR) 10Cathal Mooney: [C:03+1] "Overall lgtm thanks! Let's see if Arzhel can weigh in on Friday though just in case there is something I am missing." [puppet] - 10https://gerrit.wikimedia.org/r/1229624 (owner: 10Majavah) [19:17:15] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 1 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:21:33] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns7001.wikimedia.org [19:26:57] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org [reason: END testing authdns IPv6 change] [19:27:06] !log sukhe@dns1004 START - running authdns-update [19:28:11] !log sukhe@dns1004 END - running authdns-update [19:30:02] !log re-enable puppet on A:dnsbox: T81605 [19:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:08] T81605: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 [19:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:49:13] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1017.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:53:05] (03PS1) 10Andrew Bogott: Cloudcontrol200[56]-dev: remove obsolete rabbit host name [puppet] - 10https://gerrit.wikimedia.org/r/1229666 [19:53:05] (03PS1) 10Andrew Bogott: Initial files for Openstack version flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229667 (https://phabricator.wikimedia.org/T406516) [19:53:07] (03PS1) 10Andrew Bogott: cloudcontrol2005-dev: move to openstack version 'flamingo' [puppet] - 10https://gerrit.wikimedia.org/r/1229668 (https://phabricator.wikimedia.org/T406516) [19:53:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229666 (owner: 10Andrew Bogott) [19:54:28] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#11543186 (10Scott_French) [19:56:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229666 (owner: 10Andrew Bogott) [19:59:13] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1017.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:59:24] (03PS1) 10JHathaway: postfix: set myorigin to mydomain [puppet] - 10https://gerrit.wikimedia.org/r/1229670 (https://phabricator.wikimedia.org/T404884) [19:59:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87843 and previous config saved to /var/cache/conftool/dbconfig/20260121-195927-marostegui.json [19:59:34] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [19:59:34] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [19:59:39] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229670 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway) [20:04:52] (03CR) 10JHathaway: [C:03+2] postfix: set myorigin to mydomain [puppet] - 10https://gerrit.wikimedia.org/r/1229670 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway) [20:06:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{dse-k8s-worker[1009-1019].eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [20:09:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P87844 and previous config saved to /var/cache/conftool/dbconfig/20260121-200935-marostegui.json [20:14:37] (03CR) 10Andrew Bogott: [C:03+2] Cloudcontrol200[56]-dev: remove obsolete rabbit host name [puppet] - 10https://gerrit.wikimedia.org/r/1229666 (owner: 10Andrew Bogott) [20:14:40] (03CR) 10Andrew Bogott: [C:03+2] Initial files for Openstack version flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229667 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [20:14:42] (03CR) 10Andrew Bogott: [C:03+2] cloudcontrol2005-dev: move to openstack version 'flamingo' [puppet] - 10https://gerrit.wikimedia.org/r/1229668 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [20:19:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P87845 and previous config saved to /var/cache/conftool/dbconfig/20260121-201943-marostegui.json [20:29:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87846 and previous config saved to /var/cache/conftool/dbconfig/20260121-202951-marostegui.json [20:29:58] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:29:59] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:40:24] (03PS1) 10Catrope: Don't directly serialize PublicKeyCredential objects [extensions/WebAuthn] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229679 [20:40:37] (03PS1) 10Catrope: Don't directly serialize PublicKeyCredential objects [extensions/WebAuthn] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1229680 [20:41:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WebAuthn] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1229680 (owner: 10Catrope) [20:41:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WebAuthn] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229679 (owner: 10Catrope) [20:59:31] (03PS1) 10Andrew Bogott: Revert "cloudcontrol2005-dev: move to openstack version 'flamingo'" [puppet] - 10https://gerrit.wikimedia.org/r/1229682 [20:59:33] (03PS1) 10Andrew Bogott: Revert "Initial files for Openstack version flamingo" [puppet] - 10https://gerrit.wikimedia.org/r/1229683 [20:59:36] (03PS1) 10Andrew Bogott: Revert "Cloudcontrol200[56]-dev: remove obsolete rabbit host name" [puppet] - 10https://gerrit.wikimedia.org/r/1229684 [20:59:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229682 (owner: 10Andrew Bogott) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T2100). [21:00:05] RoanKattouw: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:04] I'll deploy my own patches [21:02:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229219 (https://phabricator.wikimedia.org/T415146) (owner: 10Catrope) [21:02:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/WebAuthn] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1229680 (owner: 10Catrope) [21:02:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/WebAuthn] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229679 (owner: 10Catrope) [21:03:30] (03Merged) 10jenkins-bot: Enable OATHAuth passkey features in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229219 (https://phabricator.wikimedia.org/T415146) (owner: 10Catrope) [21:03:49] (03Merged) 10jenkins-bot: Don't directly serialize PublicKeyCredential objects [extensions/WebAuthn] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1229680 (owner: 10Catrope) [21:03:49] (03Merged) 10jenkins-bot: Don't directly serialize PublicKeyCredential objects [extensions/WebAuthn] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229679 (owner: 10Catrope) [21:04:28] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1229219|Enable OATHAuth passkey features in production (T415146)]], [[gerrit:1229680|Don't directly serialize PublicKeyCredential objects]], [[gerrit:1229679|Don't directly serialize PublicKeyCredential objects]] [21:04:32] T415146: Enable passkeys in production - https://phabricator.wikimedia.org/T415146 [21:06:36] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229682 (owner: 10Andrew Bogott) [21:06:42] !log catrope@deploy2002 catrope: Backport for [[gerrit:1229219|Enable OATHAuth passkey features in production (T415146)]], [[gerrit:1229680|Don't directly serialize PublicKeyCredential objects]], [[gerrit:1229679|Don't directly serialize PublicKeyCredential objects]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:00] !log catrope@deploy2002 catrope: Continuing with sync [21:14:10] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1229219|Enable OATHAuth passkey features in production (T415146)]], [[gerrit:1229680|Don't directly serialize PublicKeyCredential objects]], [[gerrit:1229679|Don't directly serialize PublicKeyCredential objects]] (duration: 09m 42s) [21:14:15] T415146: Enable passkeys in production - https://phabricator.wikimedia.org/T415146 [21:21:42] (03Abandoned) 10Andrew Bogott: Revert "cloudcontrol2005-dev: move to openstack version 'flamingo'" [puppet] - 10https://gerrit.wikimedia.org/r/1229682 (owner: 10Andrew Bogott) [21:23:20] (03PS1) 10Andrew Bogott: cloudcontrol2005-dev: move to openstack version 'flamingo' for real [puppet] - 10https://gerrit.wikimedia.org/r/1229695 (https://phabricator.wikimedia.org/T406516) [21:24:39] (03CR) 10Andrew Bogott: [C:03+2] cloudcontrol2005-dev: move to openstack version 'flamingo' for real [puppet] - 10https://gerrit.wikimedia.org/r/1229695 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [21:48:26] (03PS1) 10Andrew Bogott: openstack nova: update our server name regex hack for Flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229715 (https://phabricator.wikimedia.org/T406516) [21:56:40] (03PS1) 10Andrew Bogott: Openstack keystone: remove a patched-in fix for Keystone [puppet] - 10https://gerrit.wikimedia.org/r/1229718 (https://phabricator.wikimedia.org/T406516) [21:56:43] (03PS1) 10Andrew Bogott: Openstack Keystone: remove a hack about project_id validation [puppet] - 10https://gerrit.wikimedia.org/r/1229719 (https://phabricator.wikimedia.org/T406516) [21:56:49] (03CR) 10Andrew Bogott: [C:03+2] openstack nova: update our server name regex hack for Flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229715 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [21:59:20] (03CR) 10Andrew Bogott: [C:03+2] Openstack keystone: remove a patched-in fix for Keystone [puppet] - 10https://gerrit.wikimedia.org/r/1229718 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [21:59:23] (03CR) 10Andrew Bogott: [C:03+2] Openstack Keystone: remove a hack about project_id validation [puppet] - 10https://gerrit.wikimedia.org/r/1229719 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T2200) [22:06:37] (03PS1) 10Jforrester: Revert "Fix DivisionByZeroError when calculating bitrate" [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229724 (https://phabricator.wikimedia.org/T415169) [22:34:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:47:14] (03CR) 10Aaron Schulz: [C:03+1] Revert "Fix DivisionByZeroError when calculating bitrate" [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229724 (https://phabricator.wikimedia.org/T415169) (owner: 10Jforrester) [22:53:15] PROBLEM - snapshot of s2 in codfw on backupmon1001 is CRITICAL: Last snapshot for s2 at codfw (db2197) taken on 2026-01-21 22:06:08 is 649 GiB, but the previous one was 836 GiB, a change of -22.4 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [22:56:15] (03CR) 10Aaron Schulz: [C:03+1] "Looks good to unblock the train." [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229724 (https://phabricator.wikimedia.org/T415169) (owner: 10Jforrester) [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260121T2300) [23:08:23] (03CR) 10Cwhite: [C:03+2] logstash: remove logstash-ml indices after 70d [puppet] - 10https://gerrit.wikimedia.org/r/1229198 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [23:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:22:45] (03CR) 10Arlolra: [C:03+1] Turn on Parsoid read views by default on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999060 (https://phabricator.wikimedia.org/T357054) (owner: 10C. Scott Ananian) [23:26:17] (03PS1) 10Andrew Bogott: openstack: move more of codfw1dev to openstack 'flamingo' [puppet] - 10https://gerrit.wikimedia.org/r/1229780 [23:28:15] (03CR) 10Andrew Bogott: [C:03+2] openstack: move more of codfw1dev to openstack 'flamingo' [puppet] - 10https://gerrit.wikimedia.org/r/1229780 (owner: 10Andrew Bogott) [23:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:59:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown