[00:06:25] RESOLVED: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T413525)', diff saved to https://phabricator.wikimedia.org/P87578 and previous config saved to /var/cache/conftool/dbconfig/20260116-001027-marostegui.json [00:10:33] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [00:14:11] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:14:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T413525)', diff saved to https://phabricator.wikimedia.org/P87579 and previous config saved to /var/cache/conftool/dbconfig/20260116-001449-marostegui.json [00:20:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P87580 and previous config saved to /var/cache/conftool/dbconfig/20260116-002036-marostegui.json [00:24:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P87581 and previous config saved to /var/cache/conftool/dbconfig/20260116-002457-marostegui.json [00:30:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P87582 and previous config saved to /var/cache/conftool/dbconfig/20260116-003044-marostegui.json [00:35:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P87583 and previous config saved to /var/cache/conftool/dbconfig/20260116-003506-marostegui.json [00:40:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1227484 [00:40:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1227484 (owner: 10TrainBranchBot) [00:40:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T413525)', diff saved to https://phabricator.wikimedia.org/P87584 and previous config saved to /var/cache/conftool/dbconfig/20260116-004052-marostegui.json [00:40:58] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [00:41:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1238.eqiad.wmnet with reason: Maintenance [00:41:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87585 and previous config saved to /var/cache/conftool/dbconfig/20260116-004117-marostegui.json [00:45:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T413525)', diff saved to https://phabricator.wikimedia.org/P87586 and previous config saved to /var/cache/conftool/dbconfig/20260116-004514-marostegui.json [00:45:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance [00:45:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2237 (T413525)', diff saved to https://phabricator.wikimedia.org/P87587 and previous config saved to /var/cache/conftool/dbconfig/20260116-004540-marostegui.json [00:52:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1227484 (owner: 10TrainBranchBot) [01:00:58] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1227491 [01:10:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1227491 (owner: 10TrainBranchBot) [01:14:03] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 05s) [01:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:27:05] (03PS1) 10Seawolf35gerrit: enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) [01:27:58] (03CR) 10CI reject: [V:04-1] enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [01:31:01] (03PS2) 10Seawolf35gerrit: enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) [01:31:51] (03CR) 10CI reject: [V:04-1] enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [01:32:53] (03PS3) 10Seawolf35gerrit: enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) [01:33:26] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1227491 (owner: 10TrainBranchBot) [01:33:42] (03CR) 10CI reject: [V:04-1] enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [01:36:10] (03PS4) 10Seawolf35gerrit: enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) [01:36:35] (03CR) 10Seawolf35gerrit: "Maybe I've fixed all my bad typing now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [01:39:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [01:41:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [02:07:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87588 and previous config saved to /var/cache/conftool/dbconfig/20260116-020740-marostegui.json [02:07:47] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:07:47] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:17:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P87589 and previous config saved to /var/cache/conftool/dbconfig/20260116-021748-marostegui.json [02:27:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P87590 and previous config saved to /var/cache/conftool/dbconfig/20260116-022758-marostegui.json [02:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:38:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87591 and previous config saved to /var/cache/conftool/dbconfig/20260116-023806-marostegui.json [02:38:13] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:38:13] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:38:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [03:09:40] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:54:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:00:16] (03CR) 10Codename Noreste: [C:04-1] "It looks like you forgot to add the user right (editautopatrolprotected) to the bot user group." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [04:13:20] (03CR) 10Seawolf35gerrit: "@codenamenoreste@gmail.com Bots have this right by default. For example, https://phabricator.wikimedia.org/T357298 and https://gerrit.wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [04:14:11] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:17:04] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:17:26] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:19:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:19:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:30:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87592 and previous config saved to /var/cache/conftool/dbconfig/20260116-043012-marostegui.json [04:30:17] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [04:35:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T413525)', diff saved to https://phabricator.wikimedia.org/P87593 and previous config saved to /var/cache/conftool/dbconfig/20260116-043511-marostegui.json [04:40:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P87594 and previous config saved to /var/cache/conftool/dbconfig/20260116-044020-marostegui.json [04:45:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P87595 and previous config saved to /var/cache/conftool/dbconfig/20260116-044519-marostegui.json [04:50:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P87596 and previous config saved to /var/cache/conftool/dbconfig/20260116-045028-marostegui.json [04:55:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P87597 and previous config saved to /var/cache/conftool/dbconfig/20260116-045527-marostegui.json [05:00:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87598 and previous config saved to /var/cache/conftool/dbconfig/20260116-050038-marostegui.json [05:00:44] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [05:00:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1241.eqiad.wmnet with reason: Maintenance [05:01:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1241 (T413525)', diff saved to https://phabricator.wikimedia.org/P87599 and previous config saved to /var/cache/conftool/dbconfig/20260116-050102-marostegui.json [05:05:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T413525)', diff saved to https://phabricator.wikimedia.org/P87600 and previous config saved to /var/cache/conftool/dbconfig/20260116-050536-marostegui.json [05:05:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2239.codfw.wmnet with reason: Maintenance [05:09:11] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:11] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:48:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:48:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1169 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87601 and previous config saved to /var/cache/conftool/dbconfig/20260116-054831-marostegui.json [05:48:39] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:48:39] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:16:10] !log marostegui@cumin1003 START - Cookbook sre.wikireplicas.update-views [06:21:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [06:21:13] !log marostegui@cumin1003 START - Cookbook sre.wikireplicas.update-views [06:26:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [06:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:56:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11527720 (10Kris_Litson_WMDE) I also bless this request as the lead of @Johannes_Richter_WMDE [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260116T0700) [07:09:40] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:08] (03CR) 10Muehlenhoff: "openjdk/jdk21 for Bookworm was just a one off to move CAS to Java 21 (when it started depending on it), we're currently not keeping it act" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227376 (https://phabricator.wikimedia.org/T414695) (owner: 10Bking) [07:24:09] (03PS3) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from traffic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225524 (https://phabricator.wikimedia.org/T365798) [07:26:50] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from traffic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225524 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:31:27] (03PS1) 10Muehlenhoff: Remove remaining traces of profile::puppet::agent::force_puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1227616 (https://phabricator.wikimedia.org/T365798) [07:43:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:43:12] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11527767 (10Dzahn) please see T414678#11524961 [07:49:00] (03PS1) 10Muehlenhoff: Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) [07:50:33] (03CR) 10Muehlenhoff: "Maintaining component/jdk21 for Bookworm is also an option, if e.g. OpenSearch isn't yet compatible with Trixie otherwise." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227376 (https://phabricator.wikimedia.org/T414695) (owner: 10Bking) [07:50:50] (03PS2) 10Muehlenhoff: Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) [07:52:48] (03CR) 10CI reject: [V:04-1] Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:53:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:54:38] !log phabricator - addign Johannes_Richter_WMDE to WMF-NDA T414678 [07:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:44] T414678: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678 [07:55:12] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11527773 (10Dzahn) @Johannes_Richter_WMDE Yes, that is still common practice. It must have been overlooked back then in that other task (left a comment there). I co... [07:56:43] (03CR) 10Dpogorzelski: [C:03+2] Add vLLM image in ML namespace [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [07:56:53] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] Add vLLM image in ML namespace [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [07:58:11] !log phabricator - adding Martyn.ranyard to WMF-NDA (T413994) [07:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:16] T413994: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994 [07:58:51] !log phabricator - adding kimpham to WMF-NDA (T414157) [07:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:56] T414157: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157 [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260116T0800) [08:00:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11527777 (10Dzahn) Also added recently NDAed WMDE users @kimpham and @Martyn.ranyard Sorry about missing this at first. [08:02:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:03:29] (03PS3) 10Muehlenhoff: Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) [08:04:25] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:09:25] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1227352 (https://phabricator.wikimedia.org/T410314) (owner: 10Ayounsi) [08:23:04] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11527822 (10elukey) @MatthewVernon Hi! I tried to manually delete some tests from the registry's bucket, both from eqiad (via s3cmd) and codfw (via the registry's G... [08:26:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:12] (03PS1) 10Muehlenhoff: Move validatecloudvpsfqdn.py out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1227694 (https://phabricator.wikimedia.org/T365798) [08:31:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2240.codfw.wmnet with reason: Maintenance [08:32:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:32:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2240 (T413525)', diff saved to https://phabricator.wikimedia.org/P87602 and previous config saved to /var/cache/conftool/dbconfig/20260116-083206-marostegui.json [08:32:12] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:32:58] 10SRE-Access-Requests: Update the list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T414775 (10WMDE-leszek) 03NEW [08:32:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227694 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:34:15] (03PS1) 10Elukey: ml: fix vllm's image builder config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227697 (https://phabricator.wikimedia.org/T385173) [08:35:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:36:02] (03CR) 10JMeybohm: [C:03+1] "nice, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff) [08:37:04] (03CR) 10Kevin Bazira: [C:03+1] ml: fix vllm's image builder config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227697 (https://phabricator.wikimedia.org/T385173) (owner: 10Elukey) [08:39:34] (03CR) 10Kevin Bazira: Add vLLM image in ML namespace (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [08:43:23] (03CR) 10Filippo Giunchedi: [C:03+1] Move validatecloudvpsfqdn.py out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1227694 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:44:05] (03CR) 10Filippo Giunchedi: [C:03+1] Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:45:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T413525)', diff saved to https://phabricator.wikimedia.org/P87603 and previous config saved to /var/cache/conftool/dbconfig/20260116-084557-marostegui.json [08:46:03] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:46:17] (03PS1) 10Muehlenhoff: Remove puppetmaster spec files [puppet] - 10https://gerrit.wikimedia.org/r/1227698 (https://phabricator.wikimedia.org/T365798) [08:49:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11527874 (10Johannes_Richter_WMDE) Thanks! [08:50:01] !log depool titan2001, cleaning up block 01K88XDMJ9S0T2DR5K00VG9CFE (T410152) [08:50:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:07] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [08:53:02] (03CR) 10Elukey: [C:03+1] Remove remaining traces of profile::puppet::agent::force_puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1227616 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:55:42] (03PS1) 10Muehlenhoff: Copy yamllint into the puppetserver module and use it [puppet] - 10https://gerrit.wikimedia.org/r/1227702 (https://phabricator.wikimedia.org/T365798) [08:56:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P87604 and previous config saved to /var/cache/conftool/dbconfig/20260116-085605-marostegui.json [08:58:01] (03CR) 10CI reject: [V:04-1] Copy yamllint into the puppetserver module and use it [puppet] - 10https://gerrit.wikimedia.org/r/1227702 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:58:26] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11527884 (10MatthewVernon) ` mvernon@moss-be1001:~$ sudo cephadm shell -- radosgw-admin bucket sync status --bucket=registry-restricted Inferring fsid 3f38ada2-2d88... [09:01:35] (03PS1) 10Elukey: docker_registry: allor to set the loglevel for an instance [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) [09:02:06] (03CR) 10CI reject: [V:04-1] docker_registry: allor to set the loglevel for an instance [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [09:02:20] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [09:03:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87605 and previous config saved to /var/cache/conftool/dbconfig/20260116-090353-marostegui.json [09:04:02] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:04:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:04:06] 06SRE, 10DNS, 06serviceops, 06Traffic, and 2 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11527892 (10Dzahn) Yea, it is. Languages would typically be added to `dns/templates/helpers/langlist.tmpl` but it feels like adding a non-language to the "... [09:05:43] (03PS1) 10Dzahn: add abstract.wikipedia.org to section for wikis not covered by langlist [dns] - 10https://gerrit.wikimedia.org/r/1227706 (https://phabricator.wikimedia.org/T411724) [09:06:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P87606 and previous config saved to /var/cache/conftool/dbconfig/20260116-090614-marostegui.json [09:07:08] (03PS2) 10Muehlenhoff: Copy yamllint into the puppetserver module and use it [puppet] - 10https://gerrit.wikimedia.org/r/1227702 (https://phabricator.wikimedia.org/T365798) [09:07:21] 06SRE, 10DNS, 06serviceops, 06Traffic, and 3 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11527902 (10Dzahn) I would think it belongs into the section for `Wikis with mobile site (alphabetic order), which are not covered by langlist.tmpl`. htt... [09:08:22] 06SRE, 10DNS, 06serviceops, 06Traffic, and 3 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11527904 (10Dzahn) Note that there is also a section for ` Wikis without mobile site (alphabetic order), which are not covered by langlist.tmpl` right belo... [09:09:21] (03PS2) 10Elukey: docker_registry: allor to set the loglevel for an instance [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) [09:10:01] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [09:13:17] (03PS2) 10Dzahn: add abstract.wikipedia.org to section for wikis not covered by langlist [dns] - 10https://gerrit.wikimedia.org/r/1227706 (https://phabricator.wikimedia.org/T411724) [09:14:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P87607 and previous config saved to /var/cache/conftool/dbconfig/20260116-091401-marostegui.json [09:15:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227702 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:15:32] !log attempting soft reboot of instance codesearch9.codesearch - down and can't connect - T414776 [09:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:37] T414776: Codesearch is down/unreachable (2026-01-16) - https://phabricator.wikimedia.org/T414776 [09:16:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T413525)', diff saved to https://phabricator.wikimedia.org/P87608 and previous config saved to /var/cache/conftool/dbconfig/20260116-091623-marostegui.json [09:16:31] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:16:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1242.eqiad.wmnet with reason: Maintenance [09:16:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T413525)', diff saved to https://phabricator.wikimedia.org/P87609 and previous config saved to /var/cache/conftool/dbconfig/20260116-091649-marostegui.json [09:22:33] 06SRE, 10LDAP-Access-Requests, 10Phabricator: undisable vanderwaalforces in phabricator and ldap - https://phabricator.wikimedia.org/T414774#11527922 (10taavi) I have already confirmed this with T&S based on a private email request. I still need to double-check how to invalidate the existing password to forc... [09:24:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P87611 and previous config saved to /var/cache/conftool/dbconfig/20260116-092410-marostegui.json [09:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:25:10] 10SRE-Access-Requests: Update the list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T414775#11527926 (10Dzahn) Done! Confirmed both are in our NDA spreadsheet and the Phab WMF-NDA group. Added them to the Wikitech page. [09:25:44] 10SRE-Access-Requests: Update the list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T414775#11527927 (10Dzahn) 05Open→03Resolved a:03Dzahn [09:29:00] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops, 13Patch-For-Review: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11527941 (10elukey) @MatthewVernon ah nice I used the wrong bucket name when checking the config, I still don't explain that error on apus-fe... [09:29:43] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11527944 (10Dzahn) a:05thcipriani→03None [09:30:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11527946 (10Dzahn) a:05thcipriani→03None [09:31:30] PROBLEM - SSH on stat1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:32:20] RECOVERY - SSH on stat1010 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:32:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T413525)', diff saved to https://phabricator.wikimedia.org/P87612 and previous config saved to /var/cache/conftool/dbconfig/20260116-093232-marostegui.json [09:32:39] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:34:11] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87613 and previous config saved to /var/cache/conftool/dbconfig/20260116-093418-marostegui.json [09:34:27] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:34:28] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:34:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [09:34:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87614 and previous config saved to /var/cache/conftool/dbconfig/20260116-093444-marostegui.json [09:35:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:13] (03PS1) 10Dzahn: admin: add Aisha Khatun to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1227718 (https://phabricator.wikimedia.org/T414347) [09:38:08] (03CR) 10Ayounsi: "Adding a few reviewers to hopefully unblock it until we increase transport capacity." [puppet] - 10https://gerrit.wikimedia.org/r/1218784 (https://phabricator.wikimedia.org/T411617) (owner: 10Cathal Mooney) [09:40:02] (03PS15) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [09:40:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:42:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P87615 and previous config saved to /var/cache/conftool/dbconfig/20260116-094240-marostegui.json [09:44:47] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11527995 (10FCeratto-WMF) [09:48:18] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11528001 (10FCeratto-WMF) [09:52:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P87616 and previous config saved to /var/cache/conftool/dbconfig/20260116-095248-marostegui.json [09:53:19] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11528013 (10FCeratto-WMF) Pending out of band SSH verification. [09:54:25] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11528020 (10FCeratto-WMF) [09:54:28] (03PS1) 10Gehel: wdqs: setup new test servers for Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1227726 (https://phabricator.wikimedia.org/T412235) [09:55:00] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11528025 (10FCeratto-WMF) [09:56:08] (03PS2) 10Dzahn: zookeeper: add ssl.keyStore.passwordPath [puppet] - 10https://gerrit.wikimedia.org/r/1224908 (https://phabricator.wikimedia.org/T405119) [09:56:18] (03CR) 10Btullis: [C:03+1] wdqs: setup new test servers for Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1227726 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [09:56:47] !log remove static routes for magru ranges on cr1-eqiad to revert load-balance of transport traffic T414473 (https://phabricator.wikimedia.org/P87617) [09:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:53] T414473: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473 [09:57:21] (03PS3) 10Dzahn: zookeeper: add ssl.keyStore.passwordPath [puppet] - 10https://gerrit.wikimedia.org/r/1224908 (https://phabricator.wikimedia.org/T405119) [09:57:40] (03PS4) 10Dzahn: zookeeper: add ssl.keyStore.passwordPath [puppet] - 10https://gerrit.wikimedia.org/r/1224908 (https://phabricator.wikimedia.org/T405119) [10:02:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T413525)', diff saved to https://phabricator.wikimedia.org/P87618 and previous config saved to /var/cache/conftool/dbconfig/20260116-100257-marostegui.json [10:03:04] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:03:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2245.codfw.wmnet with reason: Maintenance [10:03:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2245 (T413525)', diff saved to https://phabricator.wikimedia.org/P87619 and previous config saved to /var/cache/conftool/dbconfig/20260116-100322-marostegui.json [10:04:40] (03PS1) 10Gehel: wdqs: cleanup site.pp entries for WDQS to make it more readable [puppet] - 10https://gerrit.wikimedia.org/r/1227728 [10:04:51] (03CR) 10Dzahn: [C:04-1] "actually.. we want to set the path to a file containing the password, not the password itself and also not mix zuul and zookeper lookups ." [puppet] - 10https://gerrit.wikimedia.org/r/1224908 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [10:07:15] (03PS2) 10Gehel: wdqs: cleanup site.pp entries for WDQS to make it more readable [puppet] - 10https://gerrit.wikimedia.org/r/1227728 [10:07:50] (03CR) 10Gehel: [C:03+2] wdqs: setup new test servers for Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1227726 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [10:12:14] (03PS1) 10Brouberol: Define the airflow-sre public and internal domains [dns] - 10https://gerrit.wikimedia.org/r/1227731 (https://phabricator.wikimedia.org/T402512) [10:12:56] (03PS1) 10Brouberol: Define the airflow-sre kubeconfig files [puppet] - 10https://gerrit.wikimedia.org/r/1227732 (https://phabricator.wikimedia.org/T402512) [10:12:59] (03PS1) 10Brouberol: Setup the caching and ATS rules to publicly expose airflow-sre.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) [10:14:05] (03PS1) 10Muehlenhoff: pcc_update_facts: Rename variables [puppet] - 10https://gerrit.wikimedia.org/r/1227734 (https://phabricator.wikimedia.org/T365798) [10:14:43] (03CR) 10Elukey: [C:03+1] "two nits but LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1227731 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:15:27] (03CR) 10Elukey: [C:03+1] Define the airflow-sre kubeconfig files [puppet] - 10https://gerrit.wikimedia.org/r/1227732 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:15:50] (03CR) 10Elukey: [C:03+1] Setup the caching and ATS rules to publicly expose airflow-sre.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:19:50] (03PS1) 10Dzahn: zuul: write TLS passphrase to a file for zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/1227735 (https://phabricator.wikimedia.org/T405119) [10:20:24] (03CR) 10Vgutierrez: "please do not merge till airflow-sre.discovery.wmnet is available" [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:25:09] (03PS1) 10Bartosz Wójtowicz: ml-services: Lower resource usage for article-descriptions on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227736 (https://phabricator.wikimedia.org/T414431) [10:25:23] (03CR) 10Elukey: [C:03+1] "Also if possible let's rework the commit msg to something more meaningful. Maybe something like "trafficserver: setup caching and etc.."." [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:26:08] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1227735/7903/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1227735 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [10:34:55] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11528171 (10FCeratto-WMF) [10:35:04] (03CR) 10Clément Goubert: wikikube: decommission worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227431 (https://phabricator.wikimedia.org/T409103) (owner: 10Jasmine) [10:35:12] (03CR) 10Clément Goubert: wikikube: decommission wikikube-worker[2116-2123,2216-2241].codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227454 (https://phabricator.wikimedia.org/T409104) (owner: 10Jasmine) [10:35:17] (03PS2) 10Brouberol: trafficserver: setup caching and ATS rules to publicly expose airflow-sre.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) [10:35:28] (03CR) 10Brouberol: "Yep, as usual!" [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:35:44] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11528172 (10FCeratto-WMF) [10:36:45] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO access for dduvall - https://phabricator.wikimedia.org/T414619#11528184 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:39:48] !log installing Linux 6.12.63 on trixie hosts [10:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11528193 (10BTullis) I'm slightly confused by this, now. Has the drive swap already been done, @VRiley-WMF ? I'm checking the output from `sudo perccli64... [10:42:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11528194 (10BTullis) [10:42:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1227718 (https://phabricator.wikimedia.org/T414347) (owner: 10Dzahn) [10:45:02] (03CR) 10Btullis: [C:03+1] wdqs: cleanup site.pp entries for WDQS to make it more readable [puppet] - 10https://gerrit.wikimedia.org/r/1227728 (owner: 10Gehel) [10:48:23] (03CR) 10Brouberol: [C:03+2] Define the airflow-sre kubeconfig files [puppet] - 10https://gerrit.wikimedia.org/r/1227732 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:55:00] (03PS2) 10Trueg: blazegraph: alert on ratio of failed queries increase [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) [10:59:17] (03CR) 10Trueg: "I lowered the threshold to `0.1` which might still be too high (especially considering that I changed the metric from `30m` to `5m` which " [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) (owner: 10Trueg) [11:06:37] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:08:47] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:10:51] (03CR) 10Dzahn: [C:03+2] admin: add Aisha Khatun to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1227718 (https://phabricator.wikimedia.org/T414347) (owner: 10Dzahn) [11:13:08] (03CR) 10Kosta Harlan: IPReputation: Define data provider, URL and developer mode config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223635 (https://phabricator.wikimedia.org/T410615) (owner: 10Kosta Harlan) [11:14:14] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11528241 (10Dzahn) Hi @AKhatun_WMF you have been added to the deployment group. Welcome to deployers! Access to deployment servers should work within ~... [11:15:07] (03PS1) 10Dpogorzelski: ml-build: add missing configs [puppet] - 10https://gerrit.wikimedia.org/r/1227743 [11:15:18] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11528243 (10Dzahn) [11:15:46] (03PS4) 10Kosta Harlan: (WIP) IPReputation: Enable OpenSearch IPoid provider on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) [11:15:53] (03PS4) 10Kosta Harlan: IPReputation: Define data provider, URL and developer mode config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223635 (https://phabricator.wikimedia.org/T410615) [11:15:53] (03PS5) 10Kosta Harlan: (WIP) IPReputation: Enable OpenSearch IPoid provider on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) [11:16:30] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11528244 (10Dzahn) Closing this as resolved. For logstash access please see the comment from Tyler above. That's self-service via idm.wikimedia.org.... [11:16:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11528245 (10BTullis) 05In progress→03Resolved This is back up and running with 12 data drives. ` btullis@an-worker1148... [11:19:45] (03CR) 10Elukey: ml-build: add missing configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [11:20:09] (03CR) 10Btullis: [C:03+1] Define the airflow-sre public and internal domains [dns] - 10https://gerrit.wikimedia.org/r/1227731 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [11:20:32] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11528266 (10Dzahn) 05Open→03Resolved a:03Dzahn @AKhatun_WMF As the next step please also request the "spiderpig-access" group via IDM. See h... [11:20:39] (03CR) 10Btullis: [C:03+1] trafficserver: setup caching and ATS rules to publicly expose airflow-sre.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [11:21:42] (03PS2) 10Dpogorzelski: ml-build: add missing configs [puppet] - 10https://gerrit.wikimedia.org/r/1227743 [11:22:49] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11528272 (10FCeratto-WMF) [Pinged RKemper on IRC] [11:23:32] (03PS3) 10Dpogorzelski: ml-build: add missing configs [puppet] - 10https://gerrit.wikimedia.org/r/1227743 [11:23:40] (03CR) 10Dpogorzelski: ml-build: add missing configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [11:27:04] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [11:27:49] (03PS1) 10Muehlenhoff: Remove profile::admin::groups from old mediawiki roles [puppet] - 10https://gerrit.wikimedia.org/r/1227744 [11:27:49] (03PS1) 10Muehlenhoff: Remove mwdebuggers group [puppet] - 10https://gerrit.wikimedia.org/r/1227745 [11:28:26] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [11:29:16] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on clouddb1024:9290 - https://phabricator.wikimedia.org/T414681#11528290 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced power cable [11:30:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11528293 (10Jclark-ctr) @cmooney i have disconnected all the switches [11:30:37] (03CR) 10Elukey: "Dawid for some reason the 'auto' selector seems to lead to `WARNING: no nodes found for class: Class/Profile::Docker::Ml_builder` and then" [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [11:34:15] (03PS1) 10Kevin Bazira: ml-services: bump revertrisk CPU limit (ResourceQuota) for RR namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227746 (https://phabricator.wikimedia.org/T414060)