[00:00:47] (03Merged) 10jenkins-bot: Start reading from il_target_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226965 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [00:01:23] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1226965|Start reading from il_target_id on testwiki (T413669)]] [00:01:29] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [00:03:30] !log zabe@deploy2002 zabe: Backport for [[gerrit:1226965|Start reading from il_target_id on testwiki (T413669)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:05:30] !log zabe@deploy2002 zabe: Continuing with sync [00:09:36] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226965|Start reading from il_target_id on testwiki (T413669)]] (duration: 08m 13s) [00:09:41] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [00:14:45] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11523618 (10Papaul) Phase 1 of ULSFO migration which was changing the loopback addresses of cr1,cr4 ,mr1 and the IP address of the link between cr3 and cr4 was... [00:23:57] PROBLEM - Host an-worker1159 is DOWN: PING CRITICAL - Packet loss = 100% [00:23:57] PROBLEM - Host an-worker1160 is DOWN: PING CRITICAL - Packet loss = 100% [00:38:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1226973 [00:41:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1226973 (owner: 10TrainBranchBot) [00:50:20] (03PS1) 10Sbisson: CX3 Build 1.0.0+20260114 [extensions/ContentTranslation] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226976 (https://phabricator.wikimedia.org/T413646) [00:50:43] (03PS1) 10Sbisson: Fallback to source title if target title is not provided by cxserver [extensions/ContentTranslation] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226977 (https://phabricator.wikimedia.org/T414558) [00:51:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ContentTranslation] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226976 (https://phabricator.wikimedia.org/T413646) (owner: 10Sbisson) [00:52:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ContentTranslation] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226977 (https://phabricator.wikimedia.org/T414558) (owner: 10Sbisson) [00:54:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1226973 (owner: 10TrainBranchBot) [00:56:59] ryankemper@cumin2002 reboot-workers (PID 2845277) is awaiting input [00:57:44] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [01:00:49] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1226980 [01:10:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1226980 (owner: 10TrainBranchBot) [01:13:47] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 57s) [01:18:39] (03PS1) 10Jdrewniak: Update portals submodule for WP25 birthday preview. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226981 (https://phabricator.wikimedia.org/T128546) [01:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:27:28] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226477 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [01:33:20] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1226980 (owner: 10TrainBranchBot) [01:41:45] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb6_29418 has 2 unhealthy realservers pooled on lvs7001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [02:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:40:12] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11523794 (10Papaul) [04:00:37] (03PS1) 10Clare Ming: Enable Test Kitchen on all prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227004 (https://phabricator.wikimedia.org/T407806) [04:02:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T413525)', diff saved to https://phabricator.wikimedia.org/P87525 and previous config saved to /var/cache/conftool/dbconfig/20260115-040216-marostegui.json [04:02:22] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [04:06:59] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (dbprov1004), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:12:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P87526 and previous config saved to /var/cache/conftool/dbconfig/20260115-041225-marostegui.json [04:22:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P87527 and previous config saved to /var/cache/conftool/dbconfig/20260115-042233-marostegui.json [04:28:45] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11523834 (10Papaul) [04:32:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T413525)', diff saved to https://phabricator.wikimedia.org/P87528 and previous config saved to /var/cache/conftool/dbconfig/20260115-043242-marostegui.json [04:32:48] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [04:33:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance [04:38:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87529 and previous config saved to /var/cache/conftool/dbconfig/20260115-050448-marostegui.json [05:04:55] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:04:55] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P87530 and previous config saved to /var/cache/conftool/dbconfig/20260115-051455-marostegui.json [05:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:25:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P87532 and previous config saved to /var/cache/conftool/dbconfig/20260115-052504-marostegui.json [05:34:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:09] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11523872 (10Papaul) [05:35:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87533 and previous config saved to /var/cache/conftool/dbconfig/20260115-053512-marostegui.json [05:35:19] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:35:19] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:35:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1262.eqiad.wmnet with reason: Maintenance [05:35:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1262 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87534 and previous config saved to /var/cache/conftool/dbconfig/20260115-053537-marostegui.json [06:28:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [06:29:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1169 (T413525)', diff saved to https://phabricator.wikimedia.org/P87535 and previous config saved to /var/cache/conftool/dbconfig/20260115-062902-marostegui.json [06:29:07] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:30:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T413525)', diff saved to https://phabricator.wikimedia.org/P87536 and previous config saved to /var/cache/conftool/dbconfig/20260115-063011-marostegui.json [06:32:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:33:21] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1169 gradually with 4 steps - After schema change [06:35:25] (03CR) 10Marostegui: [C:03+1] sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [06:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:43:12] (03PS1) 10Giuseppe Lavagetto: cache::upload: rate-limit rather than blocking bingbot [puppet] - 10https://gerrit.wikimedia.org/r/1227202 [06:45:13] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11523917 (10Dzahn) a:05ATitkov→03Dzahn - site updated to version: 2026-01-14-150341 https://gerrit.wikimedia.org/r/c/operations/deploymen... [06:46:01] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11523919 (10Dzahn) 05Open→03In progress [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T0700) [07:00:05] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T0700) [07:01:17] !log restart snmp and MIB processes on asw1-b12-drmrs - T413181 [07:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:21] T413181: asw1-b12-drmrs stopped reporting metrics - https://phabricator.wikimedia.org/T413181 [07:02:46] (03CR) 10Dzahn: [C:03+2] Revert "trafficserver: disable wikipedia25" [puppet] - 10https://gerrit.wikimedia.org/r/1224959 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [07:03:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1169 gradually with 4 steps - After schema change [07:06:43] (03PS1) 10Marostegui: dbproxy2005: Add Debian Trixie note [puppet] - 10https://gerrit.wikimedia.org/r/1227204 (https://phabricator.wikimedia.org/T409398) [07:08:55] (03CR) 10Marostegui: [C:03+2] dbproxy2005: Add Debian Trixie note [puppet] - 10https://gerrit.wikimedia.org/r/1227204 (https://phabricator.wikimedia.org/T409398) (owner: 10Marostegui) [07:16:14] (03CR) 10JMeybohm: [C:03+1] "sgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1226914 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [07:18:13] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11523949 (10Dzahn) The site is active: https://www.wikipedia25.org [07:25:26] (03PS1) 10Superpes15: [slwiki] Fix temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227210 (https://phabricator.wikimedia.org/T414265) [07:28:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [07:33:14] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11523959 (10A_smart_kitten) Just a note (apologies if there's a better place to raise this): When I click on any of the 'Transcript' buttons... [07:51:59] (03PS1) 10Muehlenhoff: Record LDAP access for tadeleye [puppet] - 10https://gerrit.wikimedia.org/r/1227214 [07:53:48] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for tadeleye [puppet] - 10https://gerrit.wikimedia.org/r/1227214 (owner: 10Muehlenhoff) [07:54:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance [07:54:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2206 (T413525)', diff saved to https://phabricator.wikimedia.org/P87540 and previous config saved to /var/cache/conftool/dbconfig/20260115-075444-marostegui.json [07:54:49] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:54:56] (03PS2) 10Gergő Tisza: debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) [07:55:05] (03CR) 10CI reject: [V:04-1] debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [07:55:42] (03PS3) 10Gergő Tisza: debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) [07:55:51] (03CR) 10CI reject: [V:04-1] debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [07:55:54] (03PS4) 10Gergő Tisza: debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) [08:00:52] good morning [08:01:37] Superpes: hello, I'll deploy your change [08:01:47] Hi thanks hashar :) [08:01:57] artemkloko: good morning, I am going to deploy the WP25 change for portals [08:02:22] * hashar reads the changes [08:03:32] 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11523973 (10RKemper) Got about 40 `an-worker*` hosts done, but there's still another ~80 left to be done [08:04:43] I'll start [08:05:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227210 (https://phabricator.wikimedia.org/T414265) (owner: 10Superpes15) [08:05:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226981 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [08:06:32] changes are in the pipe https://integration.wikimedia.org/zuul/#q=mediawiki-config [08:06:38] (03Merged) 10jenkins-bot: [slwiki] Fix temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227210 (https://phabricator.wikimedia.org/T414265) (owner: 10Superpes15) [08:06:42] (03Merged) 10jenkins-bot: Update portals submodule for WP25 birthday preview. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226981 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [08:07:52] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1227210|[slwiki] Fix temporary logo for Wikipedia 25 (T414265)]], [[gerrit:1226981|Update portals submodule for WP25 birthday preview. (T128546)]] [08:07:57] T414265: Requesting temporary logo change for sl.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414265 [08:07:57] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [08:10:18] !log hashar@deploy2002 hashar, jdrewniak, superpes: Backport for [[gerrit:1227210|[slwiki] Fix temporary logo for Wikipedia 25 (T414265)]], [[gerrit:1226981|Update portals submodule for WP25 birthday preview. (T128546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:10:24] Testing! [08:12:55] Uhm... looks weird! hashar Are you able to quickly test via browser? [08:13:12] Oh now it looks fine lmao [08:13:18] Maybe a cache issue? [08:13:18] caches!! :b [08:13:38] Yep lol It's fine thanks :) [08:13:47] of course I have a wrong link [08:13:48] :b [08:14:09] artemkloko: I have pushed the change for the portal and the orange button points to a link that does not exist :/ [08:14:40] I guess cause the wikimediafoundation.org page has not been published [08:14:43] (03PS4) 10Dreamy Jazz: Write new for CheckUser user agent table migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223674 (https://phabricator.wikimedia.org/T361196) [08:14:44] (03PS4) 10Dreamy Jazz: Write new for CheckUser user agent table migration everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223675 (https://phabricator.wikimedia.org/T361196) [08:15:21] Superpes: great thanks [08:15:37] I'll most probably cancel, revert the portals update change and deploy again [08:16:22] !log hashar@deploy2002 Sync cancelled. [08:17:58] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11523999 (10Dzahn) @A_smart_kitten Thanks for reporting. The issue is known and currently a fix is being worked on. [08:18:44] (03PS1) 10Hashar: Revert "Update portals submodule for WP25 birthday preview." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227256 (https://phabricator.wikimedia.org/T128546) [08:19:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223674 (https://phabricator.wikimedia.org/T361196) (owner: 10Dreamy Jazz) [08:20:04] jouncebot: nowandnext [08:20:05] For the next 0 hour(s) and 39 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T0800) [08:20:05] In 2 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1100) [08:20:30] I've stopped the +2, waiting for others to finish their changes [08:21:00] hashar: Could you ping me when you are done? [08:21:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227256 (https://phabricator.wikimedia.org/T128546) (owner: 10Hashar) [08:21:37] Dreamy_Jazz: sure! [08:22:23] (03Merged) 10jenkins-bot: Revert "Update portals submodule for WP25 birthday preview." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227256 (https://phabricator.wikimedia.org/T128546) (owner: 10Hashar) [08:22:54] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1227256|Revert "Update portals submodule for WP25 birthday preview." (T128546 T414533)]] [08:23:00] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [08:23:00] T414533: Update the url of the CTA button for Wikipedia25 portal customisation - https://phabricator.wikimedia.org/T414533 [08:23:43] (03PS1) 10Hashar: Update portals submodule for WP25 birthday preview [2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227258 (https://phabricator.wikimedia.org/T128546) [08:25:15] !log hashar@deploy2002 hashar: Backport for [[gerrit:1227256|Revert "Update portals submodule for WP25 birthday preview." (T128546 T414533)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:25:47] 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530#11524041 (10Marostegui) 05Open→03Resolved a:03Ladsgroup I think we can consider this done. @Ladsgroup has done lots of work to 1) re... [08:25:52] !log hashar@deploy2002 hashar: Continuing with sync [08:28:28] Dreamy_Jazz: my changes are syncing [08:29:21] Thanks [08:29:52] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11524049 (10ABran-WMF) [08:29:58] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227256|Revert "Update portals submodule for WP25 birthday preview." (T128546 T414533)]] (duration: 07m 04s) [08:30:03] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [08:30:04] T414533: Update the url of the CTA button for Wikipedia25 portal customisation - https://phabricator.wikimedia.org/T414533 [08:30:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223674 (https://phabricator.wikimedia.org/T361196) (owner: 10Dreamy Jazz) [08:31:46] (03CR) 10Vgutierrez: [C:03+1] "VTCs are happy and condition properly matches the intended traffic" [puppet] - 10https://gerrit.wikimedia.org/r/1227202 (owner: 10Giuseppe Lavagetto) [08:31:50] (03Merged) 10jenkins-bot: Write new for CheckUser user agent table migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223674 (https://phabricator.wikimedia.org/T361196) (owner: 10Dreamy Jazz) [08:32:21] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1223674|Write new for CheckUser user agent table migration on group1 (T361196)]] [08:32:25] T361196: Write to the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T361196 [08:32:53] still running [08:32:56] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11524061 (10Dzahn) Unrelated to the issue reported above, but for the record. We had an initial problem with the bare domain without www be... [08:34:27] jouncebot: next [08:34:27] In 2 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1100) [08:34:32] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1223674|Write new for CheckUser user agent table migration on group1 (T361196)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:36:17] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [08:36:40] I hadn't finished testing? [08:36:48] I did [08:36:56] Okay [08:36:57] I pushed a rollback :b [08:37:56] Ah, okay [08:38:07] pff [08:38:12] of course the page has been published now [08:38:17] (03PS1) 10Dzahn: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227260 (https://phabricator.wikimedia.org/T408592) [08:38:19] hashar, Dreamy_Jazz: I'd like to enable the TestKitchen extension everywhere. It looks like we've got a lot of time after the window. If not, I can do it in the afternoon window [08:38:20] so I gotta deploy again [08:38:30] :D [08:38:32] Or maybe not :D :D :D [08:38:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:09] phuedx: has that TestKitchen extension been fixed? It overlapped/clashed with MetricsPlatform :b [08:39:25] (03CR) 10Dzahn: [C:03+2] miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227260 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [08:39:48] (I suspect the code got copy pasted between repos loosing the history but I digress) [08:39:51] anyway yea [08:39:58] but I have to push again that portals update change [08:40:15] same here with updating the birthday page.. in progress [08:40:15] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223674|Write new for CheckUser user agent table migration on group1 (T361196)]] (duration: 07m 54s) [08:40:19] T361196: Write to the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T361196 [08:40:39] hashar: Yes. It's currently enabled on testwiki. I believe the CI issues have been fixed [08:41:25] (03Merged) 10jenkins-bot: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227260 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [08:41:46] phuedx: great :] [08:41:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227258 (https://phabricator.wikimedia.org/T128546) (owner: 10Hashar) [08:42:41] !log dzahn@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [08:42:44] (03Merged) 10jenkins-bot: Update portals submodule for WP25 birthday preview [2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227258 (https://phabricator.wikimedia.org/T128546) (owner: 10Hashar) [08:43:02] !log dzahn@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [08:43:15] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1227258|Update portals submodule for WP25 birthday preview [2] (T128546 T414533)]] [08:43:21] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [08:43:21] !log dzahn@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [08:43:21] T414533: Update the url of the CTA button for Wikipedia25 portal customisation - https://phabricator.wikimedia.org/T414533 [08:43:40] !log dzahn@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [08:44:06] !log dzahn@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [08:44:30] !log dzahn@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [08:44:45] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from observability roles [puppet] - 10https://gerrit.wikimedia.org/r/1226178 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:45:30] !log hashar@deploy2002 hashar: Backport for [[gerrit:1227258|Update portals submodule for WP25 birthday preview [2] (T128546 T414533)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:45:52] !log hashar@deploy2002 hashar: Continuing with sync [08:46:05] ah this time the link worked [08:46:29] so that is poor synchronization with me deploying the www.wikipedia.org update before the target page got published by comm [08:46:31] fun times [08:46:48] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11524096 (10Dzahn) deployed latest version 2026-01-15-080024 - @A_smart_kitten is it gone for you too? [08:48:11] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11524099 (10A_smart_kitten) @dzahn checking just now on the device I used before, the 'Not Found' page was initially cached, but once I refre... [08:49:57] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227258|Update portals submodule for WP25 birthday preview [2] (T128546 T414533)]] (duration: 06m 42s) [08:50:03] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [08:50:03] T414533: Update the url of the CTA button for Wikipedia25 portal customisation - https://phabricator.wikimedia.org/T414533 [08:50:52] lets burst the cache [08:52:28] !log purged portals URLs using: `cat /srv/mediawiki-staging/portals/urls-to-purge.txt | MEDIAWIKI_STAGING_DIR=/srv/mediawiki-staging mwscript purgeList.php` # T414533 [08:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:34] !log https://www.wikipedia.org/ and click that orange button! # T414533 [08:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:43] artemkloko: change is live! [08:52:52] Dreamy_Jazz: phuedx: it is all your [08:53:01] https://www.wikipedia.org/ has been updated [08:53:44] hashar What about my patch? :) [08:53:59] Superpes: yes it should be live now [08:54:55] Wonderful! I asked because I didn't check SAL [08:55:00] Thanks for your assistance :3 [08:55:49] Superpes: thank you for the logo fix! [08:56:45] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from serviceops roles [puppet] - 10https://gerrit.wikimedia.org/r/1227261 (https://phabricator.wikimedia.org/T365798) [08:57:44] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11524113 (10Dzahn) @A_smart_kitten Yea, that is also what we saw over here. Thanks!:) [08:57:46] Hrrm. I think I can see a bug with the TestKitchen config. I'm going to hold off on the deployment until others in my team are online [08:57:54] hashar: I think you can close the window now [08:58:54] Thanks for the ping hashar, mine should have been done by that one scap I did [08:59:05] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11524114 (10Dzahn) 05In progress→03Resolved We are live - QA happening now. [09:04:01] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:04:46] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660 (10kimpham) 03NEW [09:05:12] hashar: I have a patch to wmf.11 backport, but I could do it later as well [09:06:13] kostajh: looks like phuedx and Dreamy_Jazz have finished so feel free to deploy [09:06:23] I am off, I have an appointment [09:06:29] Hello everyone, is there someone knowledgable of how to deploy the portals? [09:06:45] We just deployed a version, but it seems to need a fix [09:06:51] (03CR) 10Elukey: [C:03+2] profile::docker_registry: tune the s3 config for /restricted [puppet] - 10https://gerrit.wikimedia.org/r/1226914 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [09:07:17] artemkloko: hashar just had to go [09:07:32] thanks [09:07:45] will start deployment soon [09:08:04] kostajh: would you be able to deploy portal changes like hashar just did? [09:08:10] to help out artemkloko [09:08:23] RECOVERY - Host an-conf1006 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [09:08:49] i have a doc that could help kostajh [09:09:03] artemkloko: sure, I can look at it [09:09:13] can you share the document with me please? [09:09:13] I think there is an issue in the build step that generate the assets for wikimedia/portals/deploy [09:09:33] there is a Gulp project in wikimedia/portals which is built/invoked by a CI job which build the assets [09:09:48] and some .webm files are not added to the assets dir [09:10:12] they are thus not added when doing a `git commit -A` [09:10:44] it looks like an issue with the `npm run build-all-portals` script from wikimedia/portals [09:10:58] thus I imagine that potentially needs Jan to look into [09:11:56] and the job building the assets is https://integration.wikimedia.org/ci/job/wikimedia-portals-build/ (which result in pubshing a change for the deploy repo at https://gerrit.wikimedia.org/r/q/project:wikimedia/portals/deploy ) [09:12:01] so it is not trivial :\ [09:12:02] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from cloud roles [puppet] - 10https://gerrit.wikimedia.org/r/1227264 (https://phabricator.wikimedia.org/T365798) [09:12:08] I am off for that appointment, I'll be back at 13:30 [09:13:15] (03PS1) 10Kosta Harlan: WebRequest::getSecurityLogContext: Log if user is a bot [core] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227265 (https://phabricator.wikimedia.org/T395204) [09:13:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227265 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan) [09:13:47] artemkloko: which patch are you trying to deploy? [09:14:01] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:16:55] I am still looking into the bug, have to look into what hashar mentioned [09:18:47] (03PS2) 10Dzahn: microsites: monitor wikipedia25.org (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1224575 [09:19:13] (03PS3) 10Dzahn: microsites: monitor wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1224575 [09:19:25] (03CR) 10Dzahn: microsites: monitor wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1224575 (owner: 10Dzahn) [09:22:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO access for dduvall - https://phabricator.wikimedia.org/T414619#11524174 (10JMeybohm) a:03MoritzMuehlenhoff @MoritzMuehlenhoff assigning to you so the next clinic duty person knows you're working on this with Dan, thanks [09:22:33] (03PS4) 10Dzahn: microsites: monitor wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1224575 [09:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:24:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227004 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [09:24:57] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from search roles [puppet] - 10https://gerrit.wikimedia.org/r/1227270 (https://phabricator.wikimedia.org/T365798) [09:25:00] (03PS13) 10Daniel Kinzler: rest gateway: add tests for chart rendering [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 [09:25:06] (03CR) 10Daniel Kinzler: rest gateway: add tests for chart rendering (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 (owner: 10Daniel Kinzler) [09:26:48] (03PS2) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1217133 (https://phabricator.wikimedia.org/T338470) [09:26:49] (03PS5) 10Dzahn: microsites: monitor wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1224575 [09:27:13] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217133 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [09:27:17] (03Merged) 10jenkins-bot: WebRequest::getSecurityLogContext: Log if user is a bot [core] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227265 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan) [09:27:47] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1227265|WebRequest::getSecurityLogContext: Log if user is a bot (T395204)]] [09:27:52] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [09:28:52] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11524188 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi All hosts that are not pending decom have been migrated to single uplink, resolving. [09:29:53] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1227265|WebRequest::getSecurityLogContext: Log if user is a bot (T395204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:30:12] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE: Grant Access to analytics-privatedata-users for hmonroy - https://phabricator.wikimedia.org/T414375#11524193 (10JMeybohm) >>! In T414375#11523067, @HMonroy wrote: > @JMeybohm Hi! I'm trying a query wmf.mediawiki_history in superset. I'm... [09:32:47] !log kharlan@deploy2002 kharlan: Continuing with sync [09:33:13] (03PS4) 10Daniel Kinzler: rest gateway: implement per-policy shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225699 (https://phabricator.wikimedia.org/T413183) [09:36:51] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227265|WebRequest::getSecurityLogContext: Log if user is a bot (T395204)]] (duration: 09m 04s) [09:36:55] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [09:36:59] (03CR) 10Dzahn: [C:03+2] microsites: monitor wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1224575 (owner: 10Dzahn) [09:37:56] (03PS5) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) [09:38:21] (03CR) 10JMeybohm: [C:03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1227261 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:38:43] (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [09:39:32] (03PS2) 10Daniel Kinzler: rest gateway: include a meaningful body with 429 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226827 (https://phabricator.wikimedia.org/T405636) [09:39:41] (03CR) 10Majavah: [C:03+1] Remove profile::puppet::agent::force_puppet7 from cloud roles [puppet] - 10https://gerrit.wikimedia.org/r/1227264 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:42:58] (03PS14) 10Daniel Kinzler: charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) [09:44:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1226774 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [09:45:55] (03CR) 10Filippo Giunchedi: [C:03+1] Remove profile::puppet::agent::force_puppet7 from cloud roles [puppet] - 10https://gerrit.wikimedia.org/r/1227264 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:46:38] (03CR) 10Muehlenhoff: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1226775 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [09:47:43] (03CR) 10Elukey: [C:03+2] admin: add the analytics-sre uid and gid [puppet] - 10https://gerrit.wikimedia.org/r/1226774 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [09:56:54] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1200.eqiad.wmnet [09:57:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11524257 (10ops-monitoring-bot) Host an-worker1200.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to allow unmounting failed disk [09:58:32] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on tcp-proxy1001.eqiad.wmnet with reason: remove nftables [10:04:01] (03PS1) 10D3r1ck01: Control: Handle accepted consumers with "auth-only" grants [extensions/OAuth] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1227280 (https://phabricator.wikimedia.org/T413947) [10:04:36] (03PS1) 10D3r1ck01: Control: When saving grants, ensure array has no gaps [extensions/OAuth] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227281 [10:05:01] (03PS1) 10D3r1ck01: Control: Keep irrevocable grants when accepting new OAuth 2 consumers [extensions/OAuth] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227282 (https://phabricator.wikimedia.org/T413947) [10:05:28] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-cluster [10:05:29] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [10:06:01] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-cluster [10:06:02] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [10:07:19] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy1001.eqiad.wmnet [10:07:20] (03Abandoned) 10D3r1ck01: Control: Handle accepted consumers with "auth-only" grants [extensions/OAuth] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1227280 (https://phabricator.wikimedia.org/T413947) (owner: 10D3r1ck01) [10:08:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/OAuth] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227281 (owner: 10D3r1ck01) [10:08:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/OAuth] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227282 (https://phabricator.wikimedia.org/T413947) (owner: 10D3r1ck01) [10:09:21] (03CR) 10Vgutierrez: [C:03+2] cache::upload: rate-limit rather than blocking bingbot [puppet] - 10https://gerrit.wikimedia.org/r/1227202 (owner: 10Giuseppe Lavagetto) [10:10:39] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11524278 (10cmooney) >>! In T408892#11523618, @Papaul wrote: > Phase 1 of ULSFO migration which was changing the loopback addresses of cr1,cr4 ,mr1 and the IP... [10:11:07] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy1001.eqiad.wmnet [10:12:05] (03PS2) 10Elukey: role::puppetserver: deploy kerberos keytab for analytics-sre [puppet] - 10https://gerrit.wikimedia.org/r/1226775 (https://phabricator.wikimedia.org/T402512) [10:13:24] (03CR) 10Elukey: [C:03+2] role::puppetserver: deploy kerberos keytab for analytics-sre [puppet] - 10https://gerrit.wikimedia.org/r/1226775 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:14:54] (03PS2) 10Elukey: WIP: profile::puppetserver::volatile: add hdfs rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) [10:16:57] PROBLEM - Host an-worker1200 is DOWN: PING CRITICAL - Packet loss = 100% [10:19:41] 06SRE, 07Kubernetes, 10ServiceOps new: Failing docker registry tests - https://phabricator.wikimedia.org/T414576#11524310 (10JMeybohm) p:05Triage→03Medium The 403 vs. 401 or 404 are the result of the tests being run against a read-only (`profile::docker_registry::read_only_mode`) instance of the registry... [10:19:53] 06SRE, 07Kubernetes, 10ServiceOps new: Failing docker registry httpbb tests - https://phabricator.wikimedia.org/T414576#11524313 (10JMeybohm) [10:20:24] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from cloud roles [puppet] - 10https://gerrit.wikimedia.org/r/1227264 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:22:32] (03CR) 10Dzahn: [C:03+2] "had to follow-up and remove the nftables package via cumin and reboot the hosts - normally we don't have this case where we move from nfta" [puppet] - 10https://gerrit.wikimedia.org/r/1215284 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [10:23:03] (03PS3) 10Elukey: WIP: profile::puppetserver::volatile: add hdfs rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) [10:23:03] (03PS1) 10Elukey: role::puppetserver: add the profile to fetch the krb keytabs [puppet] - 10https://gerrit.wikimedia.org/r/1227285 (https://phabricator.wikimedia.org/T402512) [10:23:49] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:26:05] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from serviceops roles [puppet] - 10https://gerrit.wikimedia.org/r/1227261 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:27:36] (03PS4) 10Elukey: WIP: profile::puppetserver::volatile: add hdfs rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) [10:27:47] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:28:24] (03CR) 10Elukey: [C:03+2] role::puppetserver: add the profile to fetch the krb keytabs [puppet] - 10https://gerrit.wikimedia.org/r/1227285 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:30:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1190.eqiad.wmnet with reason: Maintenance [10:30:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1190 (T413525)', diff saved to https://phabricator.wikimedia.org/P87541 and previous config saved to /var/cache/conftool/dbconfig/20260115-103053-marostegui.json [10:30:57] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:34:40] (03PS1) 10Elukey: Add fake kerberos keytabs for the Puppetserver hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1227290 (https://phabricator.wikimedia.org/T402512) [10:35:01] (03CR) 10Elukey: [V:03+2 C:03+2] Add fake kerberos keytabs for the Puppetserver hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1227290 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:35:47] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:35:59] (03PS5) 10Elukey: WIP: profile::puppetserver::volatile: add hdfs rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) [10:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:38:25] FIRING: [14x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:11] (03PS6) 10Elukey: WIP: profile::puppetserver::volatile: add hdfs rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) [10:39:55] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:41:49] (03PS7) 10Elukey: WIP: profile::puppetserver::volatile: add hdfs rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) [10:42:14] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:42:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11524338 (10BTullis) The RAID controller firmware is already the latest version. {F71530261} {F71530265} I'm continuing to... [10:51:00] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp - haproxy 2.8.18 upgrade (T414318) [10:51:04] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [10:51:16] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - haproxy 2.8.18 upgrade (T414318) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1100) [11:03:25] FIRING: [15x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:45] (03Abandoned) 10Giuseppe Lavagetto: Revert "Move status, commit status/history to database" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1226867 (owner: 10Giuseppe Lavagetto) [11:10:00] !log force dbprov1004 restart [11:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:49] (03PS15) 10Daniel Kinzler: charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) [11:11:58] (03CR) 10CI reject: [V:04-1] charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [11:12:02] (03CR) 10Daniel Kinzler: charts: add redioscope chart and service (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [11:13:24] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from IF roles [puppet] - 10https://gerrit.wikimedia.org/r/1227292 (https://phabricator.wikimedia.org/T365798) [11:15:27] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11524436 (10ATitkov) QA was successful. Some people report needed a refresh for the first visit on https://wikipedia25.org/ or https://w... [11:16:56] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1200.eqiad.wmnet [11:21:13] !log installing nginx security updates [11:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:19] (03CR) 10Elukey: [C:03+1] Remove profile::puppet::agent::force_puppet7 from IF roles [puppet] - 10https://gerrit.wikimedia.org/r/1227292 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:26:18] (03PS1) 10Vgutierrez: tcpproxy: Accept connections from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1227294 [11:26:41] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227294 (owner: 10Vgutierrez) [11:26:48] (03CR) 10CI reject: [V:04-1] tcpproxy: Accept connections from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1227294 (owner: 10Vgutierrez) [11:26:54] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from IF roles [puppet] - 10https://gerrit.wikimedia.org/r/1227292 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:29:26] (03PS2) 10Vgutierrez: tcpproxy: Accept connections from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1227294 [11:29:39] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp - haproxy 2.8.18 upgrade (T414318) [11:29:42] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [11:29:51] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11524474 (10ATitkov) I know it might look too soon, but I want to request either scheduled re-deployments or the ability to deploy myself... [11:30:19] 10ops-eqiad, 06DC-Ops: dbprov1004 lost connectivity, leading to a pause in eqiad database backups - https://phabricator.wikimedia.org/T414668 (10jcrespo) 03NEW [11:31:45] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227294 (owner: 10Vgutierrez) [11:33:26] (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual model using bookworm base image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227296 (https://phabricator.wikimedia.org/T411786) [11:33:51] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - haproxy 2.8.18 upgrade (T414318) [11:35:52] (03PS2) 10Gkyziridis: ml-services: Deploy rr-multilingual model using bookworm base image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227296 (https://phabricator.wikimedia.org/T411786) [11:37:00] 06SRE, 10Observability-Metrics: Change units for "network utilization" on "host overview" dashboard to bits/sec - https://phabricator.wikimedia.org/T414670 (10cmooney) 03NEW p:05Triage→03Low [11:37:21] 06SRE, 10Observability-Metrics: Change units for "network utilization" on "host overview" dashboard to bits/sec - https://phabricator.wikimedia.org/T414670#11524521 (10cmooney) [11:37:52] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11524522 (10WMDE-leszek) I approve this request on WMDE's end. Thank you [11:39:18] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy rr-multilingual model using bookworm base image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227296 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis) [11:40:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T413525)', diff saved to https://phabricator.wikimedia.org/P87542 and previous config saved to /var/cache/conftool/dbconfig/20260115-114015-marostegui.json [11:40:19] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [11:46:17] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual model using bookworm base image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227296 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis) [11:48:06] (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual model using bookworm base image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227296 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis) [11:50:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P87543 and previous config saved to /var/cache/conftool/dbconfig/20260115-115023-marostegui.json [11:51:03] (03PS2) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from traffic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225524 (https://phabricator.wikimedia.org/T365798) [11:51:40] 10ops-eqiad, 06DC-Ops: dbprov1004 lost connectivity, leading to a pause in eqiad database backups - https://phabricator.wikimedia.org/T414668#11524548 (10jcrespo) For context, rebooting the host didn't fix the issue. [11:52:11] !log gkyziridis@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [11:52:28] !log gkyziridis@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [11:52:34] (03CR) 10Muehlenhoff: "Thanks, these were already removed (hcaptcha via https://gerrit.wikimedia.org/r/c/operations/puppet/+/1227261 and the insetup role via htt" [puppet] - 10https://gerrit.wikimedia.org/r/1225524 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:00:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P87544 and previous config saved to /var/cache/conftool/dbconfig/20260115-120032-marostegui.json [12:00:51] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671 (10kimpham) 03NEW [12:02:46] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11524578 (10cmooney) //dse-k8s-worker1013// seems fairly happy in terms of the original problem since we made the change y... [12:10:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T413525)', diff saved to https://phabricator.wikimedia.org/P87545 and previous config saved to /var/cache/conftool/dbconfig/20260115-121040-marostegui.json [12:10:44] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [12:10:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance [12:11:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2210 (T413525)', diff saved to https://phabricator.wikimedia.org/P87546 and previous config saved to /var/cache/conftool/dbconfig/20260115-121105-marostegui.json [12:16:36] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11524635 (10BTullis) >>! In T414460#11521367, @CDanis wrote: >>>! In T414460#11521085, @cmooney wrote: >> The k8s host sen... [12:22:08] (03PS1) 10Muehlenhoff: conf/etcd: Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1227307 (https://phabricator.wikimedia.org/T352245) [12:23:17] (03PS1) 10Muehlenhoff: conf/etcd: Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1227309 (https://phabricator.wikimedia.org/T352245) [12:23:43] (03CR) 10Muehlenhoff: [C:03+2] wikidough: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224708 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [12:24:31] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11524649 (10MoritzMuehlenhoff) [12:26:14] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227310 [12:27:50] 06SRE, 06serviceops, 07Kubernetes: Fix nginx config and caching for docker registry - https://phabricator.wikimedia.org/T256762#11524650 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Since there is clearly no need for optimization here, I'll resolve this now. [12:28:34] (03PS1) 10JMeybohm: httpbb: Remove assertions for X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/1227311 (https://phabricator.wikimedia.org/T414576) [12:28:43] jouncebot: nowandnext [12:28:43] No deployments scheduled for the next 0 hour(s) and 31 minute(s) [12:28:43] In 0 hour(s) and 31 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1300) [12:29:48] can I deploy a config patch? (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1226232) [12:31:52] ihurbain: no objection from me. That sampling rate definition is confusing af [12:33:57] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from Data Platform roles [puppet] - 10https://gerrit.wikimedia.org/r/1227313 (https://phabricator.wikimedia.org/T365798) [12:34:48] claime: the fact that i got confused by it is probably a good sign (but it's also how we apparently sample, and i get that, integers are good, etc) [12:34:56] anyway spiderpigging. [12:35:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) (owner: 10Isabelle Hurbain-Palatin) [12:36:30] (03Merged) 10jenkins-bot: Turn on debugging for unsafe postproc cache entries logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) (owner: 10Isabelle Hurbain-Palatin) [12:37:05] !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1226232|Turn on debugging for unsafe postproc cache entries logging (T412803)]] [12:37:09] T412803: Tweak unsafe post-processing cache keys - https://phabricator.wikimedia.org/T412803 [12:39:14] !log ihurbain@deploy2002 ihurbain: Backport for [[gerrit:1226232|Turn on debugging for unsafe postproc cache entries logging (T412803)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:39:35] RECOVERY - Host an-worker1200 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [12:41:09] (03PS2) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from Data Platform roles [puppet] - 10https://gerrit.wikimedia.org/r/1227313 (https://phabricator.wikimedia.org/T365798) [12:41:24] !log ihurbain@deploy2002 ihurbain: Continuing with sync [12:45:27] RECOVERY - Host an-worker1148 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [12:45:29] !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226232|Turn on debugging for unsafe postproc cache entries logging (T412803)]] (duration: 08m 24s) [12:45:33] T412803: Tweak unsafe post-processing cache keys - https://phabricator.wikimedia.org/T412803 [12:45:38] woot. [12:46:27] and yay, i'm seeing my new logs! [12:48:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227307 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [12:49:07] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1200 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [12:50:37] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Tracking): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11524721 (10Mvolz) So we're running at around 10% error for mediawikijs requests, we're allowe... [12:51:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227309 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [12:53:42] !log drainin Arelion transit circuit on cr1-codfw in advance of adding second 10G port to bundle [12:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:00] (03PS4) 10Muehlenhoff: etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615 [12:55:28] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Tracking): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11524727 (10Mvolz) If you look for https://thanos.wikimedia.org/graph?g0.expr=sum(rate(citoid_... [12:57:11] (03CR) 10Elukey: [C:03+1] httpbb: Remove assertions for X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/1227311 (https://phabricator.wikimedia.org/T414576) (owner: 10JMeybohm) [12:59:08] jouncebot: next [12:59:09] In 0 hour(s) and 0 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1300) [12:59:17] jouncebot: nowandnext [12:59:17] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [12:59:17] In 0 hour(s) and 0 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1300) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1300) [13:00:10] You win this time jouncebot [13:01:36] (03PS1) 10Majavah: P:toolforge: k8s: haproxy: Handle plain toolforge.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1227321 (https://phabricator.wikimedia.org/T414674) [13:01:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff) [13:03:25] FIRING: [15x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:01] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:10] (03CR) 10JMeybohm: [C:03+2] httpbb: Remove assertions for X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/1227311 (https://phabricator.wikimedia.org/T414576) (owner: 10JMeybohm) [13:09:37] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1227322 (https://phabricator.wikimedia.org/T365798) [13:15:01] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:22:57] (03CR) 10Elukey: "Left a nit but we are close!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [13:23:58] 06SRE, 07Kubernetes, 13Patch-For-Review, 10ServiceOps new: Failing docker registry httpbb tests - https://phabricator.wikimedia.org/T414576#11524771 (10JMeybohm) a:03DPogorzelski-WMF The X-Cache-Status failures are gone now: ` jayme@cumin1003:~$ sudo httpbb /srv/deployment/httpbb-tests/docker-registry/te... [13:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:25:51] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mwlog1003.eqiad.wmnet with OS bookworm [13:26:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11524779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm executed with erro... [13:26:17] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227330 (https://phabricator.wikimedia.org/T128546) [13:26:24] (03PS1) 10Muehlenhoff: Record LDAP access for aramilferaxa [puppet] - 10https://gerrit.wikimedia.org/r/1227331 [13:27:05] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mwlog1003.eqiad.wmnet with OS bookworm [13:27:08] (03CR) 10CI reject: [V:04-1] Record LDAP access for aramilferaxa [puppet] - 10https://gerrit.wikimedia.org/r/1227331 (owner: 10Muehlenhoff) [13:27:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11524785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm [13:27:33] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge: k8s: haproxy: Handle plain toolforge.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1227321 (https://phabricator.wikimedia.org/T414674) (owner: 10Majavah) [13:27:48] !log installing squid security updates [13:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:57] (03CR) 10Filippo Giunchedi: [C:03+1] Remove profile::puppet::agent::force_puppet7 for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1227322 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:28:34] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7899/co" [puppet] - 10https://gerrit.wikimedia.org/r/1227321 (https://phabricator.wikimedia.org/T414674) (owner: 10Majavah) [13:29:06] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: k8s: haproxy: Handle plain toolforge.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1227321 (https://phabricator.wikimedia.org/T414674) (owner: 10Majavah) [13:29:31] jan_drewniak: we can do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1227330 I think [13:29:35] jouncebot: nowandnext [13:29:35] For the next 0 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1300) [13:29:35] In 0 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1400) [13:30:01] hey folks, I'm going to be deploying a portals updates now just ahead of the backport window [13:30:02] (03CR) 10Hashar: [C:03+1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227330 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:31:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227330 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:32:40] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227330 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:33:12] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1227330|Bumping portals to master (T128546)]] [13:33:17] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [13:34:24] (03Abandoned) 10Muehlenhoff: Record LDAP access for aramilferaxa [puppet] - 10https://gerrit.wikimedia.org/r/1227331 (owner: 10Muehlenhoff) [13:35:18] (03PS1) 10Filippo Giunchedi: wmcs: remove value from CephSlowOps summary [alerts] - 10https://gerrit.wikimedia.org/r/1227334 (https://phabricator.wikimedia.org/T414669) [13:35:26] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:1227330|Bumping portals to master (T128546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:36:27] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11524827 (10Novem_Linguae) Are you requesting `deployment` access? > backlog deployment windows Do you mean [[ https://wikitech.wikimedia.org/wiki/Backport_windo... [13:37:07] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [13:37:51] (03CR) 10Majavah: [C:04-1] "The number is useful to see in some form, so can it be added to the description if it can't be in the summary?" [alerts] - 10https://gerrit.wikimedia.org/r/1227334 (https://phabricator.wikimedia.org/T414669) (owner: 10Filippo Giunchedi) [13:40:23] (03PS2) 10Filippo Giunchedi: wmcs: remove value from CephSlowOps summary [alerts] - 10https://gerrit.wikimedia.org/r/1227334 (https://phabricator.wikimedia.org/T414669) [13:40:35] (03CR) 10Filippo Giunchedi: "Fair point, {{done}}" [alerts] - 10https://gerrit.wikimedia.org/r/1227334 (https://phabricator.wikimedia.org/T414669) (owner: 10Filippo Giunchedi) [13:41:10] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227330|Bumping portals to master (T128546)]] (duration: 07m 58s) [13:41:14] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [13:42:02] !log upgrade wikidough to Bird 2.18 T413740 [13:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:06] T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740 [13:42:46] (03CR) 10Majavah: [C:03+1] wmcs: remove value from CephSlowOps summary [alerts] - 10https://gerrit.wikimedia.org/r/1227334 (https://phabricator.wikimedia.org/T414669) (owner: 10Filippo Giunchedi) [13:43:11] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1227322 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:43:59] hashar: I just ran the sync through spiderpig. Now I logged into deploy2002 and run `MEDIAWIKI_STAGING_DIR=/srv/mediawiki-staging | mwscript purgeList.php` [13:44:43] (03PS1) 10Filippo Giunchedi: sre: remove value from MaxConntrack summary [alerts] - 10https://gerrit.wikimedia.org/r/1227335 (https://phabricator.wikimedia.org/T414669) [13:45:38] (03CR) 10Filippo Giunchedi: [C:03+2] wmcs: remove value from CephSlowOps summary [alerts] - 10https://gerrit.wikimedia.org/r/1227334 (https://phabricator.wikimedia.org/T414669) (owner: 10Filippo Giunchedi) [13:47:23] hashar: ok, deployed and purged successfully! [13:47:33] well done! [13:48:04] I have sent some changes to the docs on https://gerrit.wikimedia.org/r/q/project:wikimedia/portals+is:open+owner:hashar [13:48:11] then I don't know whether they are accurate [13:57:30] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11524895 (10MoritzMuehlenhoff) [13:58:19] 10ops-eqiad, 06SRE, 06DC-Ops: dbprov1004 lost connectivity, leading to a pause in eqiad database backups - https://phabricator.wikimedia.org/T414668#11524898 (10Jclark-ctr) a:03Jclark-ctr [13:58:45] (03PS1) 10Elukey: role::puppetserver: remove kerberos config [puppet] - 10https://gerrit.wikimedia.org/r/1227338 (https://phabricator.wikimedia.org/T402512) [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1400) [14:00:05] Seawolf35, JSherman, stephanebisson, and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] o/ [14:00:12] o/ [14:00:14] o/ [14:00:18] o/ [14:00:19] o/ [14:00:26] I can deploy! [14:00:43] let’s start with Seawolf35 ^^ [14:00:52] Ok [14:01:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225596 (https://phabricator.wikimedia.org/T414277) (owner: 10Seawolf35gerrit) [14:01:18] (03Abandoned) 10Elukey: WIP: profile::puppetserver::volatile: add hdfs rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:01:51] (03PS1) 10Cathal Mooney: Remove offload of Comcast traffic from Arelion [homer/public] - 10https://gerrit.wikimedia.org/r/1227341 (https://phabricator.wikimedia.org/T261867) [14:02:17] (03Merged) 10jenkins-bot: ukwiki: Various changes to user rights. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225596 (https://phabricator.wikimedia.org/T414277) (owner: 10Seawolf35gerrit) [14:02:49] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1225596|ukwiki: Various changes to user rights. (T414277)]] [14:02:53] T414277: Some changes in user group rights in ukwiki - https://phabricator.wikimedia.org/T414277 [14:05:00] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, seawolf35gerrit: Backport for [[gerrit:1225596|ukwiki: Various changes to user rights. (T414277)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:05:28] Seawolf35: please test! [14:05:49] I’m using the debug cookie on my phone fyi [14:06:18] hmm, I still see the movestable right in the autoconfirmed group I think [14:06:19] (03PS2) 10Cathal Mooney: Remove offload of Comcast traffic from Arelion [homer/public] - 10https://gerrit.wikimedia.org/r/1227341 (https://phabricator.wikimedia.org/T261867) [14:06:25] RECOVERY - Host dbprov1004 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [14:06:51] 10ops-eqiad, 06SRE, 06DC-Ops: dbprov1004 lost connectivity, leading to a pause in eqiad database backups - https://phabricator.wikimedia.org/T414668#11524921 (10Jclark-ctr) @jcrespo Replaced Dac cable link came up. [14:07:04] same for the confirmed group [14:07:26] Everything else seemed to work. [14:08:01] (03CR) 10Ayounsi: [C:03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/1227341 (https://phabricator.wikimedia.org/T261867) (owner: 10Cathal Mooney) [14:08:37] 10ops-eqiad, 06SRE, 06DC-Ops: dbprov1004 lost connectivity, leading to a pause in eqiad database backups - https://phabricator.wikimedia.org/T414668#11524927 (10Jclark-ctr) 05Open→03Resolved updated netbox cableid [14:09:01] looks like the same is also true for ruwikinews, despite its 'autoconfirmed' => [ 'movestable' => false, ] [14:09:04] (03CR) 10Elukey: [C:03+2] role::puppetserver: remove kerberos config [puppet] - 10https://gerrit.wikimedia.org/r/1227338 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:09:22] * Lucas_WMDE searches phabricator [14:09:24] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226218 (owner: 10Muehlenhoff) [14:09:51] (03PS1) 10Elukey: Revert "Add fake kerberos keytabs for the Puppetserver hosts" [labs/private] - 10https://gerrit.wikimedia.org/r/1227342 [14:09:56] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "Add fake kerberos keytabs for the Puppetserver hosts" [labs/private] - 10https://gerrit.wikimedia.org/r/1227342 (owner: 10Elukey) [14:10:26] (03CR) 10Cathal Mooney: [C:03+2] Remove offload of Comcast traffic from Arelion [homer/public] - 10https://gerrit.wikimedia.org/r/1227341 (https://phabricator.wikimedia.org/T261867) (owner: 10Cathal Mooney) [14:11:05] Seawolf35: I think let’s deploy the config change anyway, but the task should then stay open for further investigation what’s going on with this right [14:11:07] does that sound okay? [14:11:35] Sounds good. [14:11:47] (03Merged) 10jenkins-bot: Remove offload of Comcast traffic from Arelion [homer/public] - 10https://gerrit.wikimedia.org/r/1227341 (https://phabricator.wikimedia.org/T261867) (owner: 10Cathal Mooney) [14:11:52] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, seawolf35gerrit: Continuing with sync [14:11:56] Everything else like change tags looks good on my end [14:11:57] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [14:12:05] alright, thanks [14:13:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678 (10Johannes_Richter_WMDE) 03NEW [14:13:57] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:15:28] JSherman: want to self-service once the current deployment is done? [14:16:02] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225596|ukwiki: Various changes to user rights. (T414277)]] (duration: 13m 13s) [14:16:06] T414277: Some changes in user group rights in ukwiki - https://phabricator.wikimedia.org/T414277 [14:16:06] Lucas_WMDE: on it [14:16:10] jclark@cumin1003 reimage (PID 1651082) is awaiting input [14:16:10] sounds good [14:16:15] ok! [14:16:19] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [14:17:20] (my spiderpig finished, you’re good to go) [14:17:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11524963 (10Johannes_Richter_WMDE) By the way I noticed {T358578} – is that still common practice @Dzahn? (I'm not in the #wmf-nda group despite signing the NDA in... [14:17:31] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [14:18:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226862 (https://phabricator.wikimedia.org/T403982) (owner: 10Jsn.sherman) [14:18:27] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [14:18:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11524970 (10Tobi_WMDE_SW) @Johannes_Richter_WMDE is part of the WMDE TechWish team, and I endorse this request. [14:18:43] (03Merged) 10jenkins-bot: Deploy PersonalDashboard to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226862 (https://phabricator.wikimedia.org/T403982) (owner: 10Jsn.sherman) [14:19:04] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:19:08] stephanebisson: do you want to do your deploy afterwards? you could probably start the gate-and-submit builds already [14:19:14] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1226862|Deploy PersonalDashboard to testwiki (T403982)]] [14:19:18] T403982: Create and deploy Extension:PersonalDashboard - https://phabricator.wikimedia.org/T403982 [14:19:44] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [14:20:08] Lucas_WMDE: yes I'll do them, getting started soon [14:20:37] ok! [14:21:24] !log jsn@deploy2002 jsn: Backport for [[gerrit:1226862|Deploy PersonalDashboard to testwiki (T403982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:22:06] (03CR) 10CDanis: [C:03+2] tcpproxy: Accept connections from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1227294 (owner: 10Vgutierrez) [14:22:22] that was a highly motivated review lol [14:22:38] !log jsn@deploy2002 jsn: Continuing with sync [14:22:43] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update reverse dns entries for arelion link ips - cmooney@cumin1003" [14:23:27] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update reverse dns entries for arelion link ips - cmooney@cumin1003" [14:23:27] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:25:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [alerts] - 10https://gerrit.wikimedia.org/r/1227335 (https://phabricator.wikimedia.org/T414669) (owner: 10Filippo Giunchedi) [14:25:20] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1159.eqiad.wmnet [14:25:31] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1159.eqiad.wmnet [14:26:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11525006 (10CDanis) >>! In T414460#11524635, @BTullis wrote: > My assumption is that this is more likely related to the ce... [14:27:46] Lucas_WMDE can I just +2 the patches manually and start the real deployment later? [14:28:12] stephanebisson: yes [14:28:19] (03CR) 10Sbisson: [C:03+2] CX3 Build 1.0.0+20260114 [extensions/ContentTranslation] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226976 (https://phabricator.wikimedia.org/T413646) (owner: 10Sbisson) [14:28:23] as long as nobody else is planning to deploy, because then they would pull in your changes ww [14:28:24] * ^^ [14:28:33] (03PS8) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [14:28:42] I think I'm next in line [14:28:45] yeah [14:28:50] (03CR) 10Sbisson: [C:03+2] Fallback to source title if target title is not provided by cxserver [extensions/ContentTranslation] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226977 (https://phabricator.wikimedia.org/T414558) (owner: 10Sbisson) [14:28:50] we're about 3/4 through syncing prod k8s on mine, so I think you're good to +2 [14:28:56] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226862|Deploy PersonalDashboard to testwiki (T403982)]] (duration: 09m 41s) [14:29:00] T403982: Create and deploy Extension:PersonalDashboard - https://phabricator.wikimedia.org/T403982 [14:29:01] stephanebisson: over to you [14:29:03] finished! [14:29:07] Thanks! [14:29:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226976 (https://phabricator.wikimedia.org/T413646) (owner: 10Sbisson) [14:29:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226977 (https://phabricator.wikimedia.org/T414558) (owner: 10Sbisson) [14:30:18] depending on how long that gate-and-submit will take we could’ve tried to squeeze in phuedx in between [14:30:22] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [14:30:27] but I don’t think it’s necessary, there should be enough time afterwards [14:32:06] Lucas_WMDE: fwiw my gut instinct is that the movestable permissions thing might be something to do with FlaggedRevs [14:33:11] ah, our favorite codebase? [14:33:20] just the one :D [14:33:34] when in doubt, blame FlaggedRevs [14:33:49] Beyond my pay grade [14:33:52] (03PS9) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [14:33:54] maybe some subtasks of T225144 are similar [14:33:54] T225144: Flagged Revs configuration may be broken - https://phabricator.wikimedia.org/T225144 [14:34:01] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [14:34:02] (I found some other Phabricator tasks that sounded related, though not quite the same) [14:34:21] T275370 [14:34:22] T275370: Unable to move pages despite being autoconfirmed on wikis with FlaggedRevs - https://phabricator.wikimedia.org/T275370 [14:35:15] my gut instinct (untested) would be to move the FlaggedRevs user group-related config that isn't currently working out of core-Permissions.php & add it to the MediaWikiServices hook in flaggedrevs.php [14:37:41] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11525053 (10WMDE-leszek) I approve this request on WMDE's end, and take the responsibility for the backlog instead of backport brainfart. @kimpham should not have... [14:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:38:14] A_smart_kitten: geeeez [14:38:18] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20260114 [extensions/ContentTranslation] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226976 (https://phabricator.wikimedia.org/T413646) (owner: 10Sbisson) [14:38:27] (03PS1) 10Jsn.sherman: Deploy PersonalDashboard to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227346 (https://phabricator.wikimedia.org/T403982) [14:38:30] I hadn’t seen that hook before. that’s… something [14:38:50] (03Merged) 10jenkins-bot: Fallback to source title if target title is not provided by cxserver [extensions/ContentTranslation] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226977 (https://phabricator.wikimedia.org/T414558) (owner: 10Sbisson) [14:39:00] yeah there’s some stuff like $wgGroupPermissions['editor']['autoreview'] = false; there [14:39:03] (03CR) 10Dzahn: [C:03+1] tcpproxy: Accept connections from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1227294 (owner: 10Vgutierrez) [14:39:11] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11525059 (10WMDE-leszek) [14:39:24] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1226976|CX3 Build 1.0.0+20260114 (T413646)]], [[gerrit:1226977|Fallback to source title if target title is not provided by cxserver (T414558)]] [14:39:28] I’ll go make a task [14:39:30] T413646: Content Translation: cannot select an existing target article; section translation is published to a redirect instead of the main article (target language: Russian). - https://phabricator.wikimedia.org/T413646 [14:39:31] T414558: Wikipedia Content Translation Tool displays blank page and never loads - https://phabricator.wikimedia.org/T414558 [14:39:39] (03PS10) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [14:39:42] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [14:40:15] (03CR) 10Vgutierrez: [C:04-1] "hiera files target eqsin, not drmrs" [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [14:41:32] !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1226976|CX3 Build 1.0.0+20260114 (T413646)]], [[gerrit:1226977|Fallback to source title if target title is not provided by cxserver (T414558)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:42:27] (03CR) 10Muehlenhoff: [C:03+2] hcaptcha proxy: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224709 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [14:43:19] !log sbisson@deploy2002 sbisson: Continuing with sync [14:43:35] Lucas_WMDE: just noting that I forgot to add the extension load to common settings to enable personaldashboard on testwiki, making my patch a noop. I just kept it moving and created a new patch to complete the enablement. Will followup in another window. [14:43:48] 10ops-eqiad, 06DC-Ops: Power Supply - PS1 Status - issue on clouddb1024:9290 - https://phabricator.wikimedia.org/T414681 (10phaultfinder) 03NEW [14:44:08] !log installing net-snmp security updates [14:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:31] (03CR) 10Filippo Giunchedi: [C:03+2] sre: remove value from MaxConntrack summary [alerts] - 10https://gerrit.wikimedia.org/r/1227335 (https://phabricator.wikimedia.org/T414669) (owner: 10Filippo Giunchedi) [14:45:06] (03PS11) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [14:45:06] (03PS1) 10CDanis: gerrit/Liberica: eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1227348 (https://phabricator.wikimedia.org/T411895) [14:45:22] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227348 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [14:45:24] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [14:46:22] JSherman: ack [14:47:00] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11525113 (10Dzahn) I can help with another deployment tomorrow, Friday 16, but not after that until next month. Whether deployment right... [14:47:31] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226976|CX3 Build 1.0.0+20260114 (T413646)]], [[gerrit:1226977|Fallback to source title if target title is not provided by cxserver (T414558)]] (duration: 08m 07s) [14:47:36] T413646: Content Translation: cannot select an existing target article; section translation is published to a redirect instead of the main article (target language: Russian). - https://phabricator.wikimedia.org/T413646 [14:47:37] T414558: Wikipedia Content Translation Tool displays blank page and never loads - https://phabricator.wikimedia.org/T414558 [14:48:30] phuedx: over to you, do you also want to self-service? [14:49:09] I can self service [14:49:16] (03CR) 10Vgutierrez: [C:03+1] gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [14:49:16] ok, go ahead :) [14:50:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227004 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [14:50:56] (03PS12) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [14:50:56] (03PS2) 10CDanis: gerrit/Liberica: eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1227348 (https://phabricator.wikimedia.org/T411895) [14:50:59] (03Merged) 10jenkins-bot: Enable Test Kitchen on all prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227004 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [14:51:00] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [14:51:06] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227348 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [14:51:30] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1227004|Enable Test Kitchen on all prod wikis (T407806)]] [14:51:34] T407806: Rename Metrics Platform Extension to Test Kitchen - https://phabricator.wikimedia.org/T407806 [14:51:47] Lucas_WMDE: aaahhhh the autoconfirmed movestable permission is *overridden* in the flaggedrevs.php MediaWikiServices hook [14:51:48] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/19adfae2241be7a72c651d64dd318dd57f560c59/wmf-config/flaggedrevs.php#207 [14:51:58] !log cdanis@cumin1003 conftool action : set/pooled=yes; selector: cluster=tcp-proxy,service=gerrit [14:53:52] !log phuedx@deploy2002 cjming, phuedx: Backport for [[gerrit:1227004|Enable Test Kitchen on all prod wikis (T407806)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:53:58] A_smart_kitten: created T414684 [14:53:58] T414684: FlaggedRevs-specific group rights from core-Permissions.php get overridden by flaggedrevs.php - https://phabricator.wikimedia.org/T414684 [14:54:30] Looking at the test servers now [14:54:52] ty Lucas_WMDE! [14:56:07] (03CR) 10Scott French: [C:03+1] conf/etcd: Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1227307 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [14:56:09] (03CR) 10Scott French: [C:03+1] conf/etcd: Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1227309 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [14:56:42] Configuration is coming through OK. There aren't any instruments or experiments using TestKitchen codepaths so I'm not expecting to see anything in the console [14:57:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T413525)', diff saved to https://phabricator.wikimedia.org/P87549 and previous config saved to /var/cache/conftool/dbconfig/20260115-145727-marostegui.json [14:57:31] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:57:50] (03CR) 10Scott French: [C:03+1] "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff) [14:59:08] The SDKs are available as expected [14:59:57] I’m going afk, I hope everything goes fine with the rest of the window [15:00:25] RECOVERY - Host an-worker1159 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:00:47] PROBLEM - SSH on an-worker1159 is CRITICAL: connect to address 10.64.153.4 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:01:06] Continuing with sync [15:01:15] !log phuedx@deploy2002 cjming, phuedx: Continuing with sync [15:02:15] (03CR) 10CDanis: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1215693/5634/" [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [15:05:16] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227004|Enable Test Kitchen on all prod wikis (T407806)]] (duration: 13m 46s) [15:05:18] !log cdanis@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs6003.drmrs.wmnet} and A:liberica [15:05:20] T407806: Rename Metrics Platform Extension to Test Kitchen - https://phabricator.wikimedia.org/T407806 [15:05:37] !log cdanis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs6003.drmrs.wmnet} and A:liberica [15:06:03] !log cdanis@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs6001.drmrs.wmnet} and A:liberica [15:06:23] !log cdanis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs6001.drmrs.wmnet} and A:liberica [15:06:30] FIRING: LibericaStaleConfig: Liberica instance lvs6003 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=drmrs&var-instance=lvs6003 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [15:06:40] lol [15:07:15] hey, at least the alerting works! [15:07:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P87551 and previous config saved to /var/cache/conftool/dbconfig/20260115-150735-marostegui.json [15:09:06] (03CR) 10Kevin Bazira: Add vLLM image in ML namespace (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [15:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:30] RESOLVED: LibericaStaleConfig: Liberica instance lvs6003 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=drmrs&var-instance=lvs6003 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [15:11:33] (03PS1) 10DCausse: search: pull wme secrets out of the connections array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227351 [15:12:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [15:13:14] (03CR) 10CI reject: [V:04-1] search: pull wme secrets out of the connections array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227351 (owner: 10DCausse) [15:13:24] (03CR) 10Cwhite: [C:03+1] Remove profile::puppet::agent::force_puppet7 from search roles [puppet] - 10https://gerrit.wikimedia.org/r/1227270 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:13:59] (03CR) 10Cwhite: [C:03+1] Remove profile::puppet::agent::force_puppet7 from Data Platform roles [puppet] - 10https://gerrit.wikimedia.org/r/1227313 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:14:56] (03CR) 10Elukey: [C:03+1] "LGTM for a test" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [15:15:01] (03PS1) 10Ayounsi: Routed ganeti: move v6_prefixes to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1227352 (https://phabricator.wikimedia.org/T410314) [15:16:53] PROBLEM - Host an-worker1159 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:06] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from search roles [puppet] - 10https://gerrit.wikimedia.org/r/1227270 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:17:40] (03PS2) 10Ayounsi: Routed ganeti: move v6_prefixes to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1227352 (https://phabricator.wikimedia.org/T410314) [15:17:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P87552 and previous config saved to /var/cache/conftool/dbconfig/20260115-151744-marostegui.json [15:17:50] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227352 (https://phabricator.wikimedia.org/T410314) (owner: 10Ayounsi) [15:21:50] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from Data Platform roles [puppet] - 10https://gerrit.wikimedia.org/r/1227313 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:21:55] RECOVERY - Host an-worker1159 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:22:27] (03PS2) 10DCausse: search: pull wme secrets out of the connections array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227351 [15:23:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87553 and previous config saved to /var/cache/conftool/dbconfig/20260115-152301-marostegui.json [15:23:07] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:23:07] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:24:10] (03CR) 10CI reject: [V:04-1] search: pull wme secrets out of the connections array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227351 (owner: 10DCausse) [15:25:59] (03PS3) 10DCausse: search: pull wme secrets out of the connections array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227351 [15:27:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T413525)', diff saved to https://phabricator.wikimedia.org/P87554 and previous config saved to /var/cache/conftool/dbconfig/20260115-152752-marostegui.json [15:27:57] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:28:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1199.eqiad.wmnet with reason: Maintenance [15:28:16] (03PS1) 10CDanis: Liberica/gerrit: 🌍‼️ 🎊 [puppet] - 10https://gerrit.wikimedia.org/r/1227356 (https://phabricator.wikimedia.org/T411895) [15:28:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1199 (T413525)', diff saved to https://phabricator.wikimedia.org/P87555 and previous config saved to /var/cache/conftool/dbconfig/20260115-152817-marostegui.json [15:28:19] PROBLEM - Host an-worker1159 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:53] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227356 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [15:29:49] RECOVERY - SSH on an-worker1159 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:29:51] RECOVERY - Host an-worker1159 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1530) [15:32:51] (03CR) 10Vgutierrez: [C:03+1] Liberica/gerrit: 🌍‼️ 🎊 [puppet] - 10https://gerrit.wikimedia.org/r/1227356 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [15:33:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P87556 and previous config saved to /var/cache/conftool/dbconfig/20260115-153309-marostegui.json [15:33:49] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [15:34:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:31] (03PS4) 10Ayounsi: Routed ganeti: move v6_prefixes to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1227352 (https://phabricator.wikimedia.org/T410314) [15:36:05] (03CR) 10CDanis: [C:03+2] Liberica/gerrit: 🌍‼️ 🎊 [puppet] - 10https://gerrit.wikimedia.org/r/1227356 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [15:36:24] (03CR) 10CDanis: [C:03+2] gerrit/Liberica: eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1227348 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [15:39:42] cmooney@cumin1003 netbox (PID 1669792) is awaiting input [15:41:50] !log cdanis@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4*} and A:liberica [15:42:30] FIRING: [6x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [15:43:12] (03PS1) 10Sbisson: Fallback to source title if target title is not provided by cxserver [extensions/ContentTranslation] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1227361 (https://phabricator.wikimedia.org/T414558) [15:43:14] jouncebot nowandnext [15:43:14] For the next 0 hour(s) and 16 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1530) [15:43:14] In 1 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1700) [15:43:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P87557 and previous config saved to /var/cache/conftool/dbconfig/20260115-154317-marostegui.json [15:43:34] !log cdanis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4*} and A:liberica [15:43:36] !log dancy@deploy2002 Installing scap version "4.232.0" for 2 host(s) [15:43:51] (03Abandoned) 10Sbisson: Fallback to source title if target title is not provided by cxserver [extensions/ContentTranslation] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1227361 (https://phabricator.wikimedia.org/T414558) (owner: 10Sbisson) [15:43:51] !log cdanis@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs5*} and A:liberica [15:44:04] cdanis: poor high-traffic2 lvs reloading config for a NOOP ;P [15:44:34] I love all my liberica children [15:45:27] !log dancy@deploy2002 Installation of scap version "4.232.0" completed for 2 hosts [15:45:46] !log cdanis@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3*} and A:liberica [15:45:54] !log cdanis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs5*} and A:liberica [15:47:30] FIRING: [6x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [15:47:41] !log cdanis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3*} and A:liberica [15:47:53] some timing issue :) [15:48:16] (03CR) 10Bking: [C:03+2] search: pull wme secrets out of the connections array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227351 (owner: 10DCausse) [15:50:04] (03Merged) 10jenkins-bot: search: pull wme secrets out of the connections array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227351 (owner: 10DCausse) [15:51:10] (03PS1) 10CDanis: LVS/gerrit: eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1227363 (https://phabricator.wikimedia.org/T411895) [15:51:32] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227363 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [15:52:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T413525)', diff saved to https://phabricator.wikimedia.org/P87558 and previous config saved to /var/cache/conftool/dbconfig/20260115-155159-marostegui.json [15:52:04] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:52:20] (03PS1) 10Trueg: blazegraph: alert on ratio of failed queries increase [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) [15:52:30] RESOLVED: [6x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [15:53:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87560 and previous config saved to /var/cache/conftool/dbconfig/20260115-155326-marostegui.json [15:53:33] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:53:34] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:53:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1263.eqiad.wmnet with reason: Maintenance [15:53:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1263 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87561 and previous config saved to /var/cache/conftool/dbconfig/20260115-155351-marostegui.json [15:57:23] (03CR) 10CDanis: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1227363/5639/" [puppet] - 10https://gerrit.wikimedia.org/r/1227363 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [16:01:50] 06SRE, 06Release-Engineering-Team, 10Scap, 06serviceops, 07Datacenter-Switchover: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11525496 (10dancy) [16:02:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P87562 and previous config saved to /var/cache/conftool/dbconfig/20260115-160208-marostegui.json [16:03:20] 06SRE, 06Release-Engineering-Team, 10Scap, 06serviceops, 07Datacenter-Switchover: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11525507 (10dancy) @Blake I've installed a new release of scap on the deploy servers. You can now use `scap lock --a... [16:03:49] (03CR) 10Vgutierrez: [C:03+1] LVS/gerrit: eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1227363 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [16:04:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227346 (https://phabricator.wikimedia.org/T403982) (owner: 10Jsn.sherman) [16:06:15] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [16:07:37] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:42] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [16:09:11] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:31] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 10.77 ms [16:12:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [16:12:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P87563 and previous config saved to /var/cache/conftool/dbconfig/20260115-161216-marostegui.json [16:14:11] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:11] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:22:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T413525)', diff saved to https://phabricator.wikimedia.org/P87564 and previous config saved to /var/cache/conftool/dbconfig/20260115-162224-marostegui.json [16:22:29] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [16:22:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance [16:22:48] (03CR) 10Trueg: "To start the discussion: I think 1.0 is way too high as a threshold." [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) (owner: 10Trueg) [16:22:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2219 (T413525)', diff saved to https://phabricator.wikimedia.org/P87565 and previous config saved to /var/cache/conftool/dbconfig/20260115-162249-marostegui.json [16:24:19] (03CR) 10Majavah: [C:04-1] "-1 for the prometheus_nodes issue specifically, but in general I'm not a huge fan of this as it relies on the realm global and in general " [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:24:36] (03CR) 10Ssingh: [C:03+1] "Sorry, my bad." [puppet] - 10https://gerrit.wikimedia.org/r/1225524 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:24:41] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11525622 (10AKhatun_WMF) I also don't have access to `ssh an-launcher1003.eqiad.wmnet`. I get a permission denied. Is this related? Are we waiting for another approval (fro... [16:27:24] (03CR) 10Gmodena: "Nice!" [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) (owner: 10Trueg) [16:33:48] (03PS1) 10Vgutierrez: varnish: Drop leading commas when X-E-E is present on Vary [puppet] - 10https://gerrit.wikimedia.org/r/1227373 [16:34:24] (03CR) 10CI reject: [V:04-1] varnish: Drop leading commas when X-E-E is present on Vary [puppet] - 10https://gerrit.wikimedia.org/r/1227373 (owner: 10Vgutierrez) [16:35:15] (03PS2) 10Vgutierrez: varnish: Drop leading commas when X-E-E is present on Vary [puppet] - 10https://gerrit.wikimedia.org/r/1227373 [16:42:20] (03PS3) 10Vgutierrez: varnish: Drop leading commas when X-E-E is present on Vary [puppet] - 10https://gerrit.wikimedia.org/r/1227373 [16:43:06] (03CR) 10Vgutierrez: [V:03+1] "VTC is happy: # top TEST /wikimedia/varnish/text/55-vary-xee.vtc passed (3.024)" [puppet] - 10https://gerrit.wikimedia.org/r/1227373 (owner: 10Vgutierrez) [16:45:28] jouncebot: nowandnext [16:45:28] No deployments scheduled for the next 0 hour(s) and 14 minute(s) [16:45:28] In 0 hour(s) and 14 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1700) [16:48:21] (03CR) 10Hnowlan: [C:03+2] thumbor: reimplement SVG max size feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226286 (https://phabricator.wikimedia.org/T411076) (owner: 10Hnowlan) [16:48:50] (03CR) 10Hnowlan: thumbor: reimplement SVG max size feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226286 (https://phabricator.wikimedia.org/T411076) (owner: 10Hnowlan) [16:51:11] (03PS2) 10Hnowlan: thumbor: reimplement SVG max size feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226286 (https://phabricator.wikimedia.org/T411076) [16:52:00] (03CR) 10CDanis: [C:03+1] varnish: Drop leading commas when X-E-E is present on Vary [puppet] - 10https://gerrit.wikimedia.org/r/1227373 (owner: 10Vgutierrez) [16:52:12] (03CR) 10Trueg: "thresholds are indeed way too high." [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) (owner: 10Trueg) [16:54:05] (03CR) 10Scott French: [C:03+1] varnish: Drop leading commas when X-E-E is present on Vary [puppet] - 10https://gerrit.wikimedia.org/r/1227373 (owner: 10Vgutierrez) [16:54:14] (03CR) 10Vgutierrez: [V:03+1 C:03+2] varnish: Drop leading commas when X-E-E is present on Vary [puppet] - 10https://gerrit.wikimedia.org/r/1227373 (owner: 10Vgutierrez) [16:54:20] (03PS1) 10Bking: java: create openjdk-21 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227376 (https://phabricator.wikimedia.org/T414695) [16:55:42] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update reverse dns entries for arelion link ips - cmooney@cumin1003" [16:55:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update reverse dns entries for arelion link ips - cmooney@cumin1003" [16:55:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:02:10] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕛☕ sudo cumin 'A:lvs-eqiad' 'disable-puppet T411895' [17:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:14] T411895: gerrit behind CDN - https://phabricator.wikimedia.org/T411895 [17:02:52] (03CR) 10CDanis: [V:03+1 C:03+2] LVS/gerrit: eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1227363 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [17:03:40] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:10] lol [17:05:03] grmbl [17:06:51] wanted to silence/ACK it but already gone? [17:08:38] removing unit file and resetting state in a moment [17:09:09] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕛☕ sudo cumin A:lvs-secondary-eqiad 'systemctl restart pybal.service' [17:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:35] !log [cumin2002:~] $ sudo cumin -b 15 'tcp-proxy*' 'rm /lib/systemd/system/prometheus-node-textfile-check-nft*' [17:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:07] !log [cumin2002:~] $ sudo cumin -b 15 'tcp-proxy*' 'systemctl reset-failed' [17:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:25] RESOLVED: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:17:35] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕛☕ sudo cumin A:lvs-high-traffic1-eqiad 'systemctl restart pybal.service' [17:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:39] (03PS1) 10CDanis: LVS/gerrit: codfw [puppet] - 10https://gerrit.wikimedia.org/r/1227391 (https://phabricator.wikimedia.org/T411895) [17:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:33:20] 10ops-codfw, 06DC-Ops: wikikube-worker2346 DOA - https://phabricator.wikimedia.org/T414708 (10Jhancock.wm) 03NEW [17:34:19] (03CR) 10CDanis: [C:03+2] LVS/gerrit: codfw [puppet] - 10https://gerrit.wikimedia.org/r/1227391 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [17:34:25] FIRING: [2x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy4001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:34:38] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕧☕ sudo cumin 'A:lvs-codfw' 'disable-puppet T411895' [17:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:43] T411895: gerrit behind CDN - https://phabricator.wikimedia.org/T411895 [17:36:46] 10ops-codfw, 06DC-Ops: wikikube-worker2346 DOA - https://phabricator.wikimedia.org/T414708#11525824 (10Jhancock.wm) [17:37:16] 10ops-codfw, 06DC-Ops: wikikube-worker2346 DOA - https://phabricator.wikimedia.org/T414708#11525825 (10Jhancock.wm) [17:37:43] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕧☕ sudo cumin A:lvs-secondary-codfw 'systemctl restart pybal.service' [17:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:25] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:41:53] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕧☕ sudo cumin A:lvs-high-traffic1-codfw 'systemctl restart pybal.service' [17:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:59] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕧☕ sudo cumin 'A:lvs-codfw or A:lvs-eqiad' 'enable-puppet T411895' [17:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:04] T411895: gerrit behind CDN - https://phabricator.wikimedia.org/T411895 [17:56:13] (03PS1) 10Milimetric: eventgate-analytics: increase instances to 30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227392 (https://phabricator.wikimedia.org/T411454) [18:00:05] bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1800) [18:00:23] (03PS1) 10CDanis: tunnelencabulator: Gerrit/CDN 🚀 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227395 (https://phabricator.wikimedia.org/T411895) [18:05:56] (03PS2) 10Seawolf35gerrit: ukwiki: Add "changetags" to sysop user group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227394 [18:06:00] (03CR) 10Ssingh: [C:03+1] "Strictly basing it on the additions to the existing code and modification for gerrit-cdn. I have not tested it so leave it to you :)" [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227395 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [18:06:25] FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:06:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227394 (owner: 10Seawolf35gerrit) [18:07:13] Nothing to deploy in my window today [18:07:46] (03PS3) 10Seawolf35gerrit: ukwiki: Add "changetags" to sysop user group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227394 (https://phabricator.wikimedia.org/T414277) [18:14:01] (03CR) 10Btullis: java: create openjdk-21 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227376 (https://phabricator.wikimedia.org/T414695) (owner: 10Bking) [18:23:11] (03CR) 10Bking: java: create openjdk-21 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227376 (https://phabricator.wikimedia.org/T414695) (owner: 10Bking) [18:35:38] (03PS1) 10Bking: opensearch-ipoid: Add codfw to list of sites [puppet] - 10https://gerrit.wikimedia.org/r/1227406 (https://phabricator.wikimedia.org/T412447) [18:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:41:48] (03CR) 10Bking: [C:03+2] opensearch-ipoid: Add codfw to list of sites [puppet] - 10https://gerrit.wikimedia.org/r/1227406 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [18:44:41] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:45:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1226922 (https://phabricator.wikimedia.org/T414619) (owner: 10Dduvall) [18:45:50] (03CR) 10Muehlenhoff: [C:03+2] admin: Add new yubikey-ssh-fido keys for dduvall [puppet] - 10https://gerrit.wikimedia.org/r/1226922 (https://phabricator.wikimedia.org/T414619) (owner: 10Dduvall) [18:46:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:47:33] (03PS2) 10CDanis: tunnelencabulator: Gerrit/CDN 🚀 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227395 (https://phabricator.wikimedia.org/T411895) [18:49:16] (03CR) 10Ssingh: [C:03+1] "Yes, fair enough :) [PS2-PS1 diff]" [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227395 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [18:49:48] (03PS3) 10CDanis: tunnelencabulator: Gerrit/CDN 🚀 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227395 (https://phabricator.wikimedia.org/T411895) [18:50:03] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:53:44] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new wikikube-worker nodes - pt1979@cumin2002" [18:53:44] (03CR) 10Ssingh: [C:03+1] tunnelencabulator: Gerrit/CDN 🚀 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227395 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [18:53:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new wikikube-worker nodes - pt1979@cumin2002" [18:53:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:55:48] (03CR) 10CDanis: [V:03+2 C:03+2] tunnelencabulator: Gerrit/CDN 🚀 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227395 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [18:58:52] (03CR) 10Ssingh: "@vgutierrez@wikimedia.org: We discussed this during the meeting and decided it was fine to merge. Can you stamp this please?" [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [18:59:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11526053 (10thcipriani) >>! In T413364#11521115, @JMeybohm wrote: > @thcipriani this needs sign-off from you as the approver for the... [19:00:05] jeena and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T1900). [19:01:51] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11526060 (10thcipriani) >>! In T414347#11512705, @BTullis wrote: > We will need approval from @Ahoelzl as your manager and from @thcipriani as the approver for the `deployme... [19:04:22] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227410 (https://phabricator.wikimedia.org/T413802) [19:04:25] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227410 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot) [19:05:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717 (10RobH) 03NEW [19:05:21] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227410 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot) [19:05:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718 (10RobH) 03NEW [19:05:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11526094 (10RobH) [19:08:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11526107 (10RobH) [19:09:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11526108 (10RobH) a:03jcrespo Jaime, I made assumptions on the racking details based on the existing ms-backup hosts. Please double-check the racking details in this task... [19:09:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11526112 (10RobH) a:03jcrespo Jaime, I made assumptions on the racking details based on the existing ms-backup hosts. Please double-check the racking details in this task... [19:09:36] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11526116 (10RobH) [19:09:37] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11526117 (10jcrespo) Will do. [19:09:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11526118 (10RobH) [19:11:27] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.11 refs T413802 [19:11:32] T413802: 1.46.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T413802 [19:27:59] (03PS3) 10A smart kitten: ukwiki: Move assignments of FlaggedRevs permissions to flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227385 (https://phabricator.wikimedia.org/T414277) [19:28:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227385 (https://phabricator.wikimedia.org/T414277) (owner: 10A smart kitten) [19:29:12] (03CR) 10A smart kitten: "Did some testing locally, this approach seems like it should (hopefully) work :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227385 (https://phabricator.wikimedia.org/T414277) (owner: 10A smart kitten) [19:30:08] (03CR) 10A smart kitten: ukwiki: Various changes to user rights. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225596 (https://phabricator.wikimedia.org/T414277) (owner: 10Seawolf35gerrit) [19:30:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [19:30:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87566 and previous config saved to /var/cache/conftool/dbconfig/20260115-193040-marostegui.json [19:30:47] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [19:30:47] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [19:31:04] 06SRE, 10DNS, 06serviceops, 06Traffic, and 2 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11526184 (10ssingh) This is typically done as part of a new wiki creation process, but Traffic is happy to help as required. [19:43:36] (03CR) 10Seawolf35gerrit: ukwiki: Various changes to user rights. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225596 (https://phabricator.wikimedia.org/T414277) (owner: 10Seawolf35gerrit) [19:48:52] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops: Q3:rack/setup/install backup2015 - https://phabricator.wikimedia.org/T414724 (10RobH) 03NEW [19:49:20] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops: Q3:rack/setup/install backup2015 - https://phabricator.wikimedia.org/T414724#11526283 (10RobH) [19:49:31] !log jasmine@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet [19:50:21] !log jasmine@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet [19:51:10] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops: Q3:rack/setup/install backup2015 - https://phabricator.wikimedia.org/T414724#11526289 (10RobH) a:03jcrespo Jaime, I had to split up the expansion and refresh budget lines for backup this quarter, so this racking task (and its parent order task) on... [19:51:14] (03CR) 10Jasmine: [C:03+2] wikikube: decommission worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048] [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102) (owner: 10Jasmine) [19:51:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T413525)', diff saved to https://phabricator.wikimedia.org/P87567 and previous config saved to /var/cache/conftool/dbconfig/20260115-195153-marostegui.json [19:51:59] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:52:30] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup2015 - https://phabricator.wikimedia.org/T414724#11526301 (10RobH) [19:52:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install X - https://phabricator.wikimedia.org/T414725 (10RobH) 03NEW [19:53:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install X - https://phabricator.wikimedia.org/T414725#11526321 (10RobH) a:03jcrespo Jaime, I had to split up the expansion and refresh budget lines for backup this quarter, so this racking task (and its parent order task) only covers the li... [19:53:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install X - https://phabricator.wikimedia.org/T414725#11526331 (10RobH) [19:54:01] !log “homer run T409102” [19:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:05] T409102: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet - https://phabricator.wikimedia.org/T409102 [19:56:15] 06SRE, 10DNS, 06serviceops, 06Traffic, and 2 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11526339 (10Jdforrester-WMF) >>! In T411724#11526184, @ssingh wrote: > This is typically done as part of a new wiki creation process, but Traffic is happy... [20:02:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P87568 and previous config saved to /var/cache/conftool/dbconfig/20260115-200202-marostegui.json [20:02:52] (03PS1) 10CDanis: services: gerrit* --> monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/1227423 (https://phabricator.wikimedia.org/T411895) [20:03:05] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227423 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [20:05:17] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet - https://phabricator.wikimedia.org/T409102#11526351 (10jasmine_) [20:07:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T413525)', diff saved to https://phabricator.wikimedia.org/P87569 and previous config saved to /var/cache/conftool/dbconfig/20260115-200721-marostegui.json [20:07:26] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:12:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P87570 and previous config saved to /var/cache/conftool/dbconfig/20260115-201210-marostegui.json [20:12:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:14:11] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:17:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P87571 and previous config saved to /var/cache/conftool/dbconfig/20260115-201730-marostegui.json [20:19:05] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:19:55] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:19:57] (03PS4) 10A smart kitten: ukwiki: Move assignments of FlaggedRevs permissions to flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227385 (https://phabricator.wikimedia.org/T414277) [20:20:56] (03CR) 10A smart kitten: "PS4 is a rebase on top of https://gerrit.wikimedia.org/r/1227394, after I realised the two patches would probably have merge conflicts wit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227385 (https://phabricator.wikimedia.org/T414277) (owner: 10A smart kitten) [20:22:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T413525)', diff saved to https://phabricator.wikimedia.org/P87572 and previous config saved to /var/cache/conftool/dbconfig/20260115-202218-marostegui.json [20:22:23] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:22:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1221.eqiad.wmnet with reason: Maintenance [20:22:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:22:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [20:23:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T413525)', diff saved to https://phabricator.wikimedia.org/P87573 and previous config saved to /var/cache/conftool/dbconfig/20260115-202305-marostegui.json [20:27:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P87574 and previous config saved to /var/cache/conftool/dbconfig/20260115-202738-marostegui.json [20:31:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727 (10RobH) 03NEW [20:31:52] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11526457 (10RobH) [20:33:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728 (10RobH) 03NEW [20:33:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11526474 (10RobH) [20:35:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11526489 (10RobH) [20:36:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11526492 (10RobH) a:03jcrespo Jaime, I had to split up the expansion and refresh budget lines for backup this quarter, so this racking task (and its parent order task) only... [20:36:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11526498 (10RobH) a:03jcrespo Jaime, I had to split up the expansion and refresh budget lines for backup this quarter, so this racking task (and its parent order task) only... [20:37:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T413525)', diff saved to https://phabricator.wikimedia.org/P87575 and previous config saved to /var/cache/conftool/dbconfig/20260115-203746-marostegui.json [20:37:51] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:37:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance [20:38:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2236 (T413525)', diff saved to https://phabricator.wikimedia.org/P87576 and previous config saved to /var/cache/conftool/dbconfig/20260115-203759-marostegui.json [20:39:07] (03PS1) 10Jasmine: wikikube: decommission wikikube-worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1227431 (https://phabricator.wikimedia.org/T409103) [20:39:38] (03CR) 10CI reject: [V:04-1] wikikube: decommission wikikube-worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1227431 (https://phabricator.wikimedia.org/T409103) (owner: 10Jasmine) [20:41:10] (03PS2) 10Jasmine: wikikube: decommission worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1227431 (https://phabricator.wikimedia.org/T409103) [20:44:47] (03CR) 10Andrew Bogott: [C:03+2] Revert "wmcs cinder backups: move all backups to 2003 so 2004 can be reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1226952 (owner: 10Andrew Bogott) [20:59:10] (03PS1) 10Clare Ming: Update experiment code for JS, PHP SDKs testing of TK [extensions/TestKitchen] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227435 (https://phabricator.wikimedia.org/T414528) [20:59:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/TestKitchen] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227435 (https://phabricator.wikimedia.org/T414528) (owner: 10Clare Ming) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T2100). [21:00:05] xSavitar, katherine_g, Seawolf35, A_smart_kitten, and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] heya, i'm here :) [21:00:17] o/ [21:00:20] o/ [21:00:29] o/ [21:01:05] I can self-service my backports then deployers/others can carry one 🙏🏽 [21:01:21] *on [21:01:23] I can help with backporting if needed [21:01:29] I will need a deployer [21:01:36] Me as well [21:01:39] jeena, I'll poke you once I'm done. [21:01:58] xSavitar: 👍 Thank you! [21:03:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227281 (owner: 10D3r1ck01) [21:03:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227282 (https://phabricator.wikimedia.org/T413947) (owner: 10D3r1ck01) [21:04:01] also happy to deploy if needed - will self-service when it's my turn [21:05:25] (03CR) 10Gmodena: blazegraph: alert on ratio of failed queries increase (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) (owner: 10Trueg) [21:06:18] (03Merged) 10jenkins-bot: Control: When saving grants, ensure array has no gaps [extensions/OAuth] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227281 (owner: 10D3r1ck01) [21:06:18] (03Merged) 10jenkins-bot: Control: Keep irrevocable grants when accepting new OAuth 2 consumers [extensions/OAuth] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227282 (https://phabricator.wikimedia.org/T413947) (owner: 10D3r1ck01) [21:06:39] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1227281|Control: When saving grants, ensure array has no gaps]], [[gerrit:1227282|Control: Keep irrevocable grants when accepting new OAuth 2 consumers (T413947)]] [21:06:43] T413947: Updating grants (via Special:OAuthManageMyGrants) of OAuth accepted consumers overrides its grants with empty array - https://phabricator.wikimedia.org/T413947 [21:08:36] !log derick@deploy2002 derick, d3r1ck01: Backport for [[gerrit:1227281|Control: When saving grants, ensure array has no gaps]], [[gerrit:1227282|Control: Keep irrevocable grants when accepting new OAuth 2 consumers (T413947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:09:07] * xSavitar testing... [21:09:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:11:58] Things look good and issues seem to have been resolved. Syncing [21:12:09] !log derick@deploy2002 derick, d3r1ck01: Continuing with sync [21:14:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:16:20] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227281|Control: When saving grants, ensure array has no gaps]], [[gerrit:1227282|Control: Keep irrevocable grants when accepting new OAuth 2 consumers (T413947)]] (duration: 09m 41s) [21:16:24] T413947: Updating grants (via Special:OAuthManageMyGrants) of OAuth accepted consumers overrides its grants with empty array - https://phabricator.wikimedia.org/T413947 [21:16:27] I’ll be afk for 5 min or so, I’ll be back in time for my patch. [21:16:47] respectfully yours, jeena / cjming, over to you 🙏🏽 [21:17:01] I'm done! [21:17:57] okay well I was going to see if I could do A_smart_kitten and Seawolf35 's one together [21:18:21] but do you want to go ahead first cjming ? [21:18:22] jeena: that's actually what i was thinking myself as well (so long as Seawolf35 is okay with it) [21:18:30] * cjming bows to jeena [21:18:32] yeah they just left the channel [21:18:41] yeah, as they're AFK right now maybe cjming or katherine_g could go first? [21:18:50] jeena: go ahead - i have to fiddle with something first [21:19:01] okay I'll do yours katherine_g [21:19:07] ok, ty [21:19:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227346 (https://phabricator.wikimedia.org/T403982) (owner: 10Jsn.sherman) [21:20:40] (03Merged) 10jenkins-bot: Deploy PersonalDashboard to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227346 (https://phabricator.wikimedia.org/T403982) (owner: 10Jsn.sherman) [21:20:59] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1227346|Deploy PersonalDashboard to testwiki (T403982)]] [21:21:04] T403982: Create and deploy Extension:PersonalDashboard - https://phabricator.wikimedia.org/T403982 [21:21:57] (03Abandoned) 10Arlolra: Support incremental roll out of Parsoid Read Views [extensions/ParserMigration] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1224837 (https://phabricator.wikimedia.org/T391881) (owner: 10Arlolra) [21:22:59] !log jhuneidi@deploy2002 jhuneidi, jsn: Backport for [[gerrit:1227346|Deploy PersonalDashboard to testwiki (T403982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:23:06] Back [21:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:25:45] katherine_g: do you need to check anything on the testservers? [21:25:50] alright, tested and it looks good on my end [21:25:57] jeena: ty [21:25:59] cool thanks! [21:26:06] !log jhuneidi@deploy2002 jhuneidi, jsn: Continuing with sync [21:26:09] (03CR) 10Ssingh: services: gerrit* --> monitoring_setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227423 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [21:27:26] Seawolf35: we were wondering if your change can be deployed with A_smart_kitten 's together? [21:27:36] Sure [21:27:45] :) [21:30:17] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227346|Deploy PersonalDashboard to testwiki (T403982)]] (duration: 09m 18s) [21:30:21] T403982: Create and deploy Extension:PersonalDashboard - https://phabricator.wikimedia.org/T403982 [21:31:14] cjming: I'm going to go ahead and do the remaining two now [21:31:28] jeena: great - thanks! [21:32:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227385 (https://phabricator.wikimedia.org/T414277) (owner: 10A smart kitten) [21:32:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227394 (https://phabricator.wikimedia.org/T414277) (owner: 10Seawolf35gerrit) [21:33:33] (03Merged) 10jenkins-bot: ukwiki: Add "changetags" to sysop user group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227394 (https://phabricator.wikimedia.org/T414277) (owner: 10Seawolf35gerrit) [21:33:35] (03Merged) 10jenkins-bot: ukwiki: Move assignments of FlaggedRevs permissions to flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227385 (https://phabricator.wikimedia.org/T414277) (owner: 10A smart kitten) [21:33:53] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1227385|ukwiki: Move assignments of FlaggedRevs permissions to flaggedrevs.php (T414277 T414684)]], [[gerrit:1227394|ukwiki: Add "changetags" to sysop user group. (T414277)]] [21:33:59] T414277: Some changes in user group rights in ukwiki - https://phabricator.wikimedia.org/T414277 [21:33:59] T414684: FlaggedRevs-specific group rights from core-Permissions.php get overridden by flaggedrevs.php - https://phabricator.wikimedia.org/T414684 [21:35:50] !log jhuneidi@deploy2002 asmartkitten, seawolf35gerrit, jhuneidi: Backport for [[gerrit:1227385|ukwiki: Move assignments of FlaggedRevs permissions to flaggedrevs.php (T414277 T414684)]], [[gerrit:1227394|ukwiki: Add "changetags" to sysop user group. (T414277)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:36:09] looking (cc Seawolf35) [21:36:12] Testing [21:37:06] my patch looks good AFAICS :] [21:37:20] Mine lgtm [21:37:28] thanks! [21:37:36] !log jhuneidi@deploy2002 asmartkitten, seawolf35gerrit, jhuneidi: Continuing with sync [21:39:40] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:39] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227385|ukwiki: Move assignments of FlaggedRevs permissions to flaggedrevs.php (T414277 T414684)]], [[gerrit:1227394|ukwiki: Add "changetags" to sysop user group. (T414277)]] (duration: 07m 46s) [21:41:45] T414277: Some changes in user group rights in ukwiki - https://phabricator.wikimedia.org/T414277 [21:41:45] T414684: FlaggedRevs-specific group rights from core-Permissions.php get overridden by flaggedrevs.php - https://phabricator.wikimedia.org/T414684 [21:42:01] jeena: thank you for deploying! [21:42:04] cjming: ready for you [21:42:10] A_smart_kitten: yw! [21:42:10] tysm! [21:42:20] jeena ty! [21:42:27] yw! [21:43:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227435 (https://phabricator.wikimedia.org/T414528) (owner: 10Clare Ming) [21:43:37] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:43:55] RECOVERY - Host titan1002 is UP: PING WARNING - Packet loss = 33%, RTA = 1653.67 ms [21:44:11] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:45:08] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:47:17] 06SRE, 10Release Pipeline, 06serviceops, 06Release-Engineering-Team (Seen): Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041#11526839 (10Eevans) [21:49:11] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:52:45] (03Merged) 10jenkins-bot: Update experiment code for JS, PHP SDKs testing of TK [extensions/TestKitchen] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227435 (https://phabricator.wikimedia.org/T414528) (owner: 10Clare Ming) [21:53:06] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1227435|Update experiment code for JS, PHP SDKs testing of TK (T414528 T414530)]] [21:53:12] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team (Radar), 07Sustainability (Incident Followup): Provision anonymous session storage - https://phabricator.wikimedia.org/T408935#11526882 (10Eevans) [21:53:13] T414528: Run synthetic experiment using Javascript SDK in Test Kitchen - https://phabricator.wikimedia.org/T414528 [21:53:13] T414530: Run synthetic experiment using PHP SDK in Test Kitchen - https://phabricator.wikimedia.org/T414530 [21:55:07] !log cjming@deploy2002 cjming: Backport for [[gerrit:1227435|Update experiment code for JS, PHP SDKs testing of TK (T414528 T414530)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:56:07] checking [21:58:18] syncing [21:58:30] !log cjming@deploy2002 cjming: Continuing with sync [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T2200) [22:02:29] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227435|Update experiment code for JS, PHP SDKs testing of TK (T414528 T414530)]] (duration: 09m 23s) [22:02:34] T414528: Run synthetic experiment using Javascript SDK in Test Kitchen - https://phabricator.wikimedia.org/T414528 [22:02:35] T414530: Run synthetic experiment using PHP SDK in Test Kitchen - https://phabricator.wikimedia.org/T414530 [22:03:47] (03CR) 10CI reject: [V:04-1] logos: Add WP25 temporary logo for Hausa Wikipedia (hawiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227443 (https://phabricator.wikimedia.org/T414736) (owner: 10SarthakSingh2904) [22:06:40] FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:45] (03PS2) 10SarthakSingh2904: logos: Add WP25 temporary logo for Hausa Wikipedia (hawiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227443 (https://phabricator.wikimedia.org/T414736) [22:24:47] (03CR) 10CI reject: [V:04-1] logos: Add WP25 temporary logo for Hausa Wikipedia (hawiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227443 (https://phabricator.wikimedia.org/T414736) (owner: 10SarthakSingh2904) [22:25:10] (03CR) 10Ryan Kemper: [C:03+1] java: create openjdk-21 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227376 (https://phabricator.wikimedia.org/T414695) (owner: 10Bking) [22:25:32] (03CR) 10Bking: [C:03+2] java: create openjdk-21 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227376 (https://phabricator.wikimedia.org/T414695) (owner: 10Bking) [22:25:43] (03CR) 10Bking: [V:03+2 C:03+2] java: create openjdk-21 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227376 (https://phabricator.wikimedia.org/T414695) (owner: 10Bking) [22:32:14] (03Abandoned) 10SarthakSingh2904: logos: Add WP25 temporary logo for Hausa Wikipedia (hawiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227443 (https://phabricator.wikimedia.org/T414736) (owner: 10SarthakSingh2904) [22:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:59:07] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:02:37] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new wikikube-worker nodes - pt1979@cumin2002" [23:02:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new wikikube-worker nodes - pt1979@cumin2002" [23:02:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:04:25] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11527136 (10Papaul) [23:09:25] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:13:13] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11527155 (10Papaul) [23:13:46] (03PS1) 10Jasmine: wikikube: decommission wikikube-worker[2116-2123,2216-2241].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1227454 (https://phabricator.wikimedia.org/T409104) [23:52:01] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:52:23] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:54:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:54:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown