[00:01:50] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd200[567] - https://phabricator.wikimedia.org/T406795#11445928 (10Jhancock.wm) [00:09:54] (03CR) 10Jdlrobson: [C:04-1] "-1 is for a small note documenting the group which would be helpful" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216839 (https://phabricator.wikimedia.org/T410164) (owner: 10LorenMora) [00:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:14:57] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:34:03] FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:40:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1216875 [00:40:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1216875 (owner: 10TrainBranchBot) [00:44:10] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11446008 (10RKemper) >>! In T411919#11442705, @Jclark-ctr wrote: > If I can do tomorrow between 3pm -6pm est time @rkemper. Does same time on weds work? [00:53:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1216875 (owner: 10TrainBranchBot) [01:01:00] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1216877 [01:10:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1216877 (owner: 10TrainBranchBot) [01:18:51] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 17m 50s) [01:36:10] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1216877 (owner: 10TrainBranchBot) [01:59:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:46:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:48:39] (03PS1) 10DLynch: mobileSectionSwitch: action_context needs to be stringified [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216883 (https://phabricator.wikimedia.org/T410803) [04:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:12:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216613 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [04:12:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216883 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [04:14:57] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:34:03] FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:46:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:06:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T410589)', diff saved to https://phabricator.wikimedia.org/P86488 and previous config saved to /var/cache/conftool/dbconfig/20251210-050603-ladsgroup.json [05:06:07] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:10:02] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:21:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P86489 and previous config saved to /var/cache/conftool/dbconfig/20251210-052110-ladsgroup.json [05:35:02] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:36:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P86490 and previous config saved to /var/cache/conftool/dbconfig/20251210-053618-ladsgroup.json [05:51:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T410589)', diff saved to https://phabricator.wikimedia.org/P86491 and previous config saved to /var/cache/conftool/dbconfig/20251210-055125-ladsgroup.json [05:51:29] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:51:31] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2223.codfw.wmnet with reason: Maintenance [05:51:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T410589)', diff saved to https://phabricator.wikimedia.org/P86492 and previous config saved to /var/cache/conftool/dbconfig/20251210-055138-ladsgroup.json [05:56:03] !log dpogorzelski@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS trixie [05:58:25] (03PS6) 10Ryan Kemper: wdqs: add availability sli recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) [05:59:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:02:46] (03Abandoned) 10Ryan Kemper: wdqs: detect blazegraph deadlock [alerts] - 10https://gerrit.wikimedia.org/r/1198161 (https://phabricator.wikimedia.org/T389859) (owner: 10Ryan Kemper) [06:03:33] (03CR) 10Ryan Kemper: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/alerts/+/1216825" [alerts] - 10https://gerrit.wikimedia.org/r/1212170 (https://phabricator.wikimedia.org/T389859) (owner: 10Gehel) [06:06:48] (03PS2) 10Ryan Kemper: sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [06:11:29] (03CR) 10CI reject: [V:04-1] sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [06:14:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:15:39] FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:17:08] (03PS3) 10Ryan Kemper: sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [06:20:34] (03PS4) 10Ryan Kemper: sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [06:20:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:25:28] (03PS5) 10Ryan Kemper: sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [06:36:00] (03CR) 10Ryan Kemper: "Okay, did my best to fix the merge conflicts in sre/wdqs/data-transfer.py. Small comment inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [06:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T0700) [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:08:32] (03PS7) 10Elukey: wdqs: add availability sli recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [08:09:29] (03CR) 10Elukey: [C:03+1] "Seems to work fine, just tested it in Thanos!" [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [08:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:14:46] (03CR) 10Slyngshede: [C:03+1] "Key verified out of band." [puppet] - 10https://gerrit.wikimedia.org/r/1216823 (https://phabricator.wikimedia.org/T412126) (owner: 10Ryan Kemper) [08:14:57] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:22:59] (03PS1) 10Elukey: team-sre: avoid cert-expiry alerts for staging endpoints [alerts] - 10https://gerrit.wikimedia.org/r/1217107 [08:25:46] (03PS1) 10Brouberol: global_config: expose the hiveserver2 port for hive services [puppet] - 10https://gerrit.wikimedia.org/r/1217108 (https://phabricator.wikimedia.org/T408819) [08:26:43] (03CR) 10Brouberol: [C:03+1] wdqs: correct deploy tag and add codfw as site [alerts] - 10https://gerrit.wikimedia.org/r/1216825 (https://phabricator.wikimedia.org/T389859) (owner: 10Ryan Kemper) [08:28:48] (03PS3) 10Gehel: query_service: relax alerting on WDQS lag [alerts] - 10https://gerrit.wikimedia.org/r/1215201 (https://phabricator.wikimedia.org/T411772) [08:29:38] (03PS1) 10Slyngshede: data.yaml extension for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1217110 [08:30:42] (03CR) 10Slyngshede: [C:03+2] data.yaml extension for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1217110 (owner: 10Slyngshede) [08:30:52] 10ops-eqiad, 06DC-Ops: eno1 on db1182:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T412183 (10phaultfinder) 03NEW [08:33:44] (03PS1) 10Slyngshede: data.yaml contract extension for atitkov [puppet] - 10https://gerrit.wikimedia.org/r/1217112 [08:33:59] (03CR) 10Brouberol: [C:03+1] query_service: relax alerting on WDQS lag [alerts] - 10https://gerrit.wikimedia.org/r/1215201 (https://phabricator.wikimedia.org/T411772) (owner: 10Gehel) [08:34:03] FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:36:23] (03CR) 10Slyngshede: [C:03+2] data.yaml contract extension for atitkov [puppet] - 10https://gerrit.wikimedia.org/r/1217112 (owner: 10Slyngshede) [08:39:06] (03PS4) 10Gehel: query_service: relax alerting on WDQS lag [alerts] - 10https://gerrit.wikimedia.org/r/1215201 (https://phabricator.wikimedia.org/T411772) [08:39:11] (03PS1) 10Slyngshede: data.yaml extension for dani [puppet] - 10https://gerrit.wikimedia.org/r/1217114 [08:40:27] (03CR) 10Gehel: [C:03+2] query_service: relax alerting on WDQS lag [alerts] - 10https://gerrit.wikimedia.org/r/1215201 (https://phabricator.wikimedia.org/T411772) (owner: 10Gehel) [08:40:59] (03CR) 10Slyngshede: [C:03+2] data.yaml extension for dani [puppet] - 10https://gerrit.wikimedia.org/r/1217114 (owner: 10Slyngshede) [08:46:36] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11446302 (10jcrespo) [08:48:34] (03PS1) 10Slyngshede: data.yaml offboarding Joanna [puppet] - 10https://gerrit.wikimedia.org/r/1217117 [08:49:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11446304 (10jcrespo) [08:52:15] (03PS2) 10Cathal Mooney: Nokia: add support for SR-Linux v25 or v24 [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) [08:53:57] (03CR) 10Jcrespo: [C:03+1] "Looks fine to me, but giving a chance to Moritz for review." [puppet] - 10https://gerrit.wikimedia.org/r/1216823 (https://phabricator.wikimedia.org/T412126) (owner: 10Ryan Kemper) [08:57:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11446307 (10jcrespo) p:05Triage→03High [08:59:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:00:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:10:20] !log joal@deploy2002 Started deploy [analytics/refinery@6e8f9d4] (hadoop-test): Regular analytics train TEST [analytics/refinery@6e8f9d4a] [09:10:24] (03PS1) 10Dpogorzelski: Merge branch 'production' into dev [puppet] - 10https://gerrit.wikimedia.org/r/1217121 [09:10:24] (03PS1) 10Dpogorzelski: Merge branch 'production' into dev [puppet] - 10https://gerrit.wikimedia.org/r/1217122 [09:10:24] (03PS1) 10Dpogorzelski: ml-build: prep hosts as simple GPU nodes [puppet] - 10https://gerrit.wikimedia.org/r/1217123 [09:11:18] (03CR) 10Slyngshede: [C:03+1] "I'm playing Moritz for the week :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1216823 (https://phabricator.wikimedia.org/T412126) (owner: 10Ryan Kemper) [09:11:25] !log joal@deploy2002 Finished deploy [analytics/refinery@6e8f9d4] (hadoop-test): Regular analytics train TEST [analytics/refinery@6e8f9d4a] (duration: 01m 04s) [09:11:45] !log joal@deploy2002 Started deploy [analytics/refinery@6e8f9d4]: Regular analytics train [analytics/refinery@6e8f9d4a] [09:12:15] (03Abandoned) 10Dpogorzelski: ml-build: prep hosts as simple GPU nodes [puppet] - 10https://gerrit.wikimedia.org/r/1217123 (owner: 10Dpogorzelski) [09:12:15] (03Abandoned) 10Dpogorzelski: Merge branch 'production' into dev [puppet] - 10https://gerrit.wikimedia.org/r/1217122 (owner: 10Dpogorzelski) [09:12:15] (03Abandoned) 10Dpogorzelski: Merge branch 'production' into dev [puppet] - 10https://gerrit.wikimedia.org/r/1217121 (owner: 10Dpogorzelski) [09:12:19] (03CR) 10Muehlenhoff: "Indeed, no need to wait for me at all." [puppet] - 10https://gerrit.wikimedia.org/r/1216823 (https://phabricator.wikimedia.org/T412126) (owner: 10Ryan Kemper) [09:12:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1217117 (owner: 10Slyngshede) [09:13:11] (03PS1) 10Dpogorzelski: ml-build: prep hosts as simple GPU nodes [puppet] - 10https://gerrit.wikimedia.org/r/1217125 [09:14:05] (03PS1) 10Samtar: Set wgEnableWatchlistLabels for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217124 (https://phabricator.wikimedia.org/T411836) [09:14:15] !log joal@deploy2002 Finished deploy [analytics/refinery@6e8f9d4]: Regular analytics train [analytics/refinery@6e8f9d4a] (duration: 02m 30s) [09:14:38] !log joal@deploy2002 Started deploy [analytics/refinery@6e8f9d4] (thin): Regular analytics train THIN [analytics/refinery@6e8f9d4a] [09:14:58] (03CR) 10Elukey: [C:03+1] ml-build: prep hosts as simple GPU nodes [puppet] - 10https://gerrit.wikimedia.org/r/1217125 (owner: 10Dpogorzelski) [09:15:25] (03CR) 10Dpogorzelski: [C:03+2] ml-build: prep hosts as simple GPU nodes [puppet] - 10https://gerrit.wikimedia.org/r/1217125 (owner: 10Dpogorzelski) [09:15:50] (03CR) 10Slyngshede: [C:03+2] data.yaml offboarding Joanna [puppet] - 10https://gerrit.wikimedia.org/r/1217117 (owner: 10Slyngshede) [09:15:52] !log joal@deploy2002 Finished deploy [analytics/refinery@6e8f9d4] (thin): Regular analytics train THIN [analytics/refinery@6e8f9d4a] (duration: 01m 13s) [09:16:00] slyngs: :( [09:16:33] elukey: Very much :( [09:20:48] (03PS4) 10Klausman: aptrepo: Expand ROCm 6.4 packagelist to full set [puppet] - 10https://gerrit.wikimedia.org/r/1216826 [09:21:10] (03CR) 10Jcrespo: [C:03+2] ryankemper: fido-based ssh access [puppet] - 10https://gerrit.wikimedia.org/r/1216823 (https://phabricator.wikimedia.org/T412126) (owner: 10Ryan Kemper) [09:21:17] (03Abandoned) 10Klausman: aptrepo: Expand ROCm 6.4 packagelist to full set [puppet] - 10https://gerrit.wikimedia.org/r/1216826 (owner: 10Klausman) [09:27:47] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11446366 (10jcrespo) I've deployed it to bast1003, can you test? [09:28:22] (03PS1) 10Dpogorzelski: ml-build: do not install rocm [puppet] - 10https://gerrit.wikimedia.org/r/1217126 [09:28:34] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11446367 (10jcrespo) [09:28:48] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217126 (owner: 10Dpogorzelski) [09:29:28] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11446382 (10jcrespo) a:03jcrespo [09:30:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11446386 (10jcrespo) @WMDE-leszek May I ask for approval? [09:33:51] (03CR) 10Btullis: [C:03+1] global_config: expose the hiveserver2 port for hive services [puppet] - 10https://gerrit.wikimedia.org/r/1217108 (https://phabricator.wikimedia.org/T408819) (owner: 10Brouberol) [09:34:15] (03CR) 10Brouberol: [C:03+2] global_config: expose the hiveserver2 port for hive services [puppet] - 10https://gerrit.wikimedia.org/r/1217108 (https://phabricator.wikimedia.org/T408819) (owner: 10Brouberol) [09:43:50] (03CR) 10Dpogorzelski: [C:03+2] ml-build: do not install rocm [puppet] - 10https://gerrit.wikimedia.org/r/1217126 (owner: 10Dpogorzelski) [09:44:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:45:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:47:17] !log jelto@puppetserver1001 conftool action : set/pooled=no; selector: name=tcp-proxy6001.drmrs.wmnet [09:49:42] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:50:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:50:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:55:37] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11446434 (10jcrespo) @Solenne_Lazare_WMDE You have been added to the NDA and WMDE LDAP groups, which means you should have... [09:59:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:01:39] !log jelto@puppetserver1001 conftool action : set/pooled=no; selector: cluster=tcp-proxy,service=gerrit,dc=drmrs [10:04:25] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.rename from ml-lab1001 to ml-build1001 [10:04:48] !log dpogorzelski@cumin1003 START - Cookbook sre.dns.netbox [10:10:46] !log dpogorzelski@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming ml-lab1001 to ml-build1001 - dpogorzelski@cumin1003" [10:11:30] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming ml-lab1001 to ml-build1001 - dpogorzelski@cumin1003" [10:11:30] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:11:30] !log dpogorzelski@cumin1003 START - Cookbook sre.dns.wipe-cache ml-build1001 on all recursors [10:11:33] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-build1001 on all recursors [10:11:34] !log dpogorzelski@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ml-build1001 [10:11:35] !log jelto@puppetserver1001 conftool action : set/pooled=no; selector: cluster=tcp-proxy,service=gerrit [10:12:15] (03CR) 10Arnaudb: [C:03+2] gerrit: add a confirmation prompt on rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1216592 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [10:13:32] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-build1001 [10:14:10] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from ml-lab1001 to ml-build1001 [10:14:42] RESOLVED: [12x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:17:04] (03Merged) 10jenkins-bot: gerrit: add a confirmation prompt on rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1216592 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [10:19:30] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.reimage for host ml-build1001.eqiad.wmnet with OS trixie [10:34:25] (03PS1) 10Jcrespo: admin: Add production access to Solenne_Lazare_WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1217135 (https://phabricator.wikimedia.org/T411977) [10:35:32] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-build1001.eqiad.wmnet with reason: host reimage [10:37:17] (03PS1) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1217133 (https://phabricator.wikimedia.org/T338470) [10:39:12] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-build1001.eqiad.wmnet with reason: host reimage [10:44:13] (03PS2) 10Jcrespo: admin: Add production access to Solenne_Lazare_WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1217135 (https://phabricator.wikimedia.org/T411977) [10:51:04] (03CR) 10Ayounsi: [C:03+1] "overall lgtm, one nit inline" [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) (owner: 10Cathal Mooney) [10:52:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jcrespo - https://phabricator.wikimedia.org/T412192 (10jcrespo) 03NEW [10:53:35] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Nokia: how to approach schema differences in SR-Linux versions - https://phabricator.wikimedia.org/T412157#11446599 (10ayounsi) I think the ideal would be to store all the OS versions (Debian, Juniper, Nokia) in Netbox, to for example not h... [10:54:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jcrespo - https://phabricator.wikimedia.org/T412192#11446600 (10jcrespo) [10:54:21] (03PS2) 10Ayounsi: Turnilo: annotate well known JA3N [puppet] - 10https://gerrit.wikimedia.org/r/1216753 [10:54:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jcrespo - https://phabricator.wikimedia.org/T412192#11446601 (10jcrespo) @KOfori could you approve my request? [10:54:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:55:18] (03CR) 10Ayounsi: Turnilo: annotate well known JA3N (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1216753 (owner: 10Ayounsi) [10:55:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:56:10] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jcrespo - https://phabricator.wikimedia.org/T412192#11446602 (10jcrespo) [10:59:20] (03PS1) 10Aqu: Airflow analytics-test: Add pg_stat_statements extension [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217138 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1100) [11:00:14] (03CR) 10Jelto: "see my comment in I4c8ec2c5f6bbca511a41deb0bcdb8407c1ff792d" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [11:03:13] (03CR) 10Btullis: [C:03+2] Airflow analytics-test: Add pg_stat_statements extension [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217138 (owner: 10Aqu) [11:04:04] (03CR) 10Kosta Harlan: Define config for v2 of suggested investigations instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216865 (https://phabricator.wikimedia.org/T409260) (owner: 10Dreamy Jazz) [11:05:02] (03Merged) 10jenkins-bot: Airflow analytics-test: Add pg_stat_statements extension [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217138 (owner: 10Aqu) [11:06:42] (03CR) 10Dreamy Jazz: Define config for v2 of suggested investigations instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216865 (https://phabricator.wikimedia.org/T409260) (owner: 10Dreamy Jazz) [11:15:31] (03CR) 10Kosta Harlan: [C:03+1] Define config for v2 of suggested investigations instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216865 (https://phabricator.wikimedia.org/T409260) (owner: 10Dreamy Jazz) [11:21:24] (03PS1) 10Btullis: Update the spark master parameter to use the correct option [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217145 (https://phabricator.wikimedia.org/T406833) [11:23:02] (03PS2) 10Btullis: Update the spark master parameter to use the correct option [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217145 (https://phabricator.wikimedia.org/T406833) [11:27:23] (03PS1) 10Btullis: Correct the values for postgresql parameters on analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217149 (https://phabricator.wikimedia.org/T412003) [11:31:06] jouncebot: nowandnext [11:31:07] For the next 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1100) [11:31:07] In 0 hour(s) and 28 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1200) [11:35:25] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1017.eqiad.wmnet [11:41:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1017.eqiad.wmnet [11:43:31] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1217135 (https://phabricator.wikimedia.org/T411977) (owner: 10Jcrespo) [11:43:49] (03CR) 10Btullis: [C:03+2] Correct the values for postgresql parameters on analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217149 (https://phabricator.wikimedia.org/T412003) (owner: 10Btullis) [11:45:29] (03Merged) 10jenkins-bot: Correct the values for postgresql parameters on analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217149 (https://phabricator.wikimedia.org/T412003) (owner: 10Btullis) [11:50:33] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-analytics-test: apply [11:50:40] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-analytics-test: apply [11:57:42] (03PS1) 10Slyngshede: Keymanagement: Increase allowed size on key_type field [software/bitu] - 10https://gerrit.wikimedia.org/r/1217157 (https://phabricator.wikimedia.org/T411816) [12:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1200). [12:01:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1217157 (https://phabricator.wikimedia.org/T411816) (owner: 10Slyngshede) [12:03:20] (03CR) 10Slyngshede: [C:03+2] Keymanagement: Increase allowed size on key_type field [software/bitu] - 10https://gerrit.wikimedia.org/r/1217157 (https://phabricator.wikimedia.org/T411816) (owner: 10Slyngshede) [12:06:22] (03Merged) 10jenkins-bot: Keymanagement: Increase allowed size on key_type field [software/bitu] - 10https://gerrit.wikimedia.org/r/1217157 (https://phabricator.wikimedia.org/T411816) (owner: 10Slyngshede) [12:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:19:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11446962 (10Jclark-ctr) a:05colewhite→03Jclark-ctr [12:20:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11446963 (10Jclark-ctr) [12:22:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11446980 (10Jclark-ctr) [12:22:26] (03CR) 10Michael Große: [C:03+1] Enable HTML confirmation email on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211813 (https://phabricator.wikimedia.org/T410971) (owner: 10Urbanecm) [12:25:24] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on db1182:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T412183#11447046 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced Sfp-t [12:26:15] (03CR) 10Jcrespo: [C:03+2] admin: Add production access to Solenne_Lazare_WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1217135 (https://phabricator.wikimedia.org/T411977) (owner: 10Jcrespo) [12:30:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11447084 (10Jclark-ctr) a:05Eevans→03Jclark-ctr [12:31:03] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11447086 (10jcrespo) 05Open→03Resolved @Solenne_Lazare_WMDE Access has been deployed, please give it 30 minutes to... [12:32:48] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11447097 (10jcrespo) [12:33:30] (03CR) 10Btullis: [C:03+2] Update the spark master parameter to use the correct option [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217145 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:35:20] (03Merged) 10jenkins-bot: Update the spark master parameter to use the correct option [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217145 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:35:47] (03CR) 10Nvdtn19: [C:03+1] Configuration for viwikivoyage per T405724 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [12:42:48] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:49:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jcrespo - https://phabricator.wikimedia.org/T412192#11447163 (10KOfori) This is approved. [12:50:44] (03CR) 10Santiago Faci: Define config for v2 of suggested investigations instrument (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216865 (https://phabricator.wikimedia.org/T409260) (owner: 10Dreamy Jazz) [12:52:54] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [12:53:02] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [13:04:08] (03CR) 10Arlolra: [C:03+1] ExtensionDistributor: mark 1.45 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216674 (https://phabricator.wikimedia.org/T408482) (owner: 10MacFan4000) [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:05:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216674 (https://phabricator.wikimedia.org/T408482) (owner: 10MacFan4000) [13:06:39] here [13:06:46] looking to see if it's thumbor [13:07:29] Here as well [13:08:19] mostly 503s 504s [13:08:33] it's been heating up since 1200 [13:08:36] er 0700 [13:08:50] thumbor has *not* been erroring in lockstep [13:09:50] I don't see a clear rise in traffic that matches [13:09:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:10:00] it alerting in esams is fishy also [13:10:28] slight increase in eqiad also though which tracks [13:12:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jcrespo - https://phabricator.wikimedia.org/T412192#11447225 (10jcrespo) [13:16:34] (03PS1) 10Jcrespo: admin: Add jynus to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1217177 (https://phabricator.wikimedia.org/T412192) [13:18:18] (03CR) 10Jcrespo: "I would appreciate a +1 (assuming you are ok with it)" [puppet] - 10https://gerrit.wikimedia.org/r/1217177 (https://phabricator.wikimedia.org/T412192) (owner: 10Jcrespo) [13:20:16] !log hnowlan@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [13:25:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T410589)', diff saved to https://phabricator.wikimedia.org/P86496 and previous config saved to /var/cache/conftool/dbconfig/20251210-132459-ladsgroup.json [13:25:04] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [13:27:01] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [13:27:22] (03CR) 10Slyngshede: [C:03+1] "Perfectly OK." [puppet] - 10https://gerrit.wikimedia.org/r/1217177 (https://phabricator.wikimedia.org/T412192) (owner: 10Jcrespo) [13:31:08] (03PS1) 10Sbisson: CX3 Build 1.0.0+20251209 [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217181 (https://phabricator.wikimedia.org/T384485) [13:32:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217181 (https://phabricator.wikimedia.org/T384485) (owner: 10Sbisson) [13:33:06] (03PS2) 10Jcrespo: admin: Add jynus to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1217177 (https://phabricator.wikimedia.org/T412192) [13:33:58] (03PS3) 10Jcrespo: admin: Add jynus to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1217177 (https://phabricator.wikimedia.org/T412192) [13:34:34] (03PS1) 10KartikMistry: Update Recommendation API to 2025-12-09-164214-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217182 (https://phabricator.wikimedia.org/T384485) [13:35:06] (03CR) 10Jcrespo: [C:03+2] admin: Add jynus to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1217177 (https://phabricator.wikimedia.org/T412192) (owner: 10Jcrespo) [13:36:41] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Jcrespo - https://phabricator.wikimedia.org/T412192#11447271 (10jcrespo) 05Open→03Resolved p:05Triage→03Medium a:03jcrespo [13:37:56] (03CR) 10Sbisson: [C:03+2] Update Recommendation API to 2025-12-09-164214-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217182 (https://phabricator.wikimedia.org/T384485) (owner: 10KartikMistry) [13:38:44] (03PS1) 10Btullis: Add two basic spark pod templates in a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217183 (https://phabricator.wikimedia.org/T406833) [13:39:40] (03Merged) 10jenkins-bot: Update Recommendation API to 2025-12-09-164214-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217182 (https://phabricator.wikimedia.org/T384485) (owner: 10KartikMistry) [13:40:06] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM and seems to match I9b9fe4c0dd." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216674 (https://phabricator.wikimedia.org/T408482) (owner: 10MacFan4000) [13:40:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P86497 and previous config saved to /var/cache/conftool/dbconfig/20251210-134007-ladsgroup.json [13:40:13] (03CR) 10CI reject: [V:04-1] Add two basic spark pod templates in a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217183 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [13:41:18] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:43:28] (03PS1) 10Dpogorzelski: ml-build: add hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1217184 [13:45:30] (03PS2) 10Btullis: Add two basic spark pod templates in a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217183 (https://phabricator.wikimedia.org/T406833) [13:46:50] (03CR) 10Elukey: "I'd personally add also" [puppet] - 10https://gerrit.wikimedia.org/r/1217184 (owner: 10Dpogorzelski) [13:46:58] (03CR) 10CI reject: [V:04-1] Add two basic spark pod templates in a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217183 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [13:47:04] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:50:02] (03PS2) 10Dpogorzelski: ml-build: add hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1217184 [13:51:58] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:52:03] (03CR) 10Dpogorzelski: "added the first 2 but hosts/ml-build1001.yaml already has:" [puppet] - 10https://gerrit.wikimedia.org/r/1217184 (owner: 10Dpogorzelski) [13:53:38] !log Updated Recommendation API to 2025-12-09-164214-production (T384485, T409338, T409332) [13:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:45] T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485 [13:53:45] T409338: Include nominated collection suggestions for topic filter - https://phabricator.wikimedia.org/T409338 [13:53:46] T409332: Over represent nominated collection in 'all collections' suggestions - https://phabricator.wikimedia.org/T409332 [13:55:02] (03PS3) 10Dpogorzelski: ml-build: add hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1217184 [13:55:08] (03PS2) 10Dreamy Jazz: Define config for v2 of suggested investigations instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216865 (https://phabricator.wikimedia.org/T409260) [13:55:11] (03CR) 10Dreamy Jazz: Define config for v2 of suggested investigations instrument (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216865 (https://phabricator.wikimedia.org/T409260) (owner: 10Dreamy Jazz) [13:55:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P86499 and previous config saved to /var/cache/conftool/dbconfig/20251210-135514-ladsgroup.json [13:59:36] (03CR) 10Elukey: "yeah you can remove those now as well, and copy them to the role's hiera specific." [puppet] - 10https://gerrit.wikimedia.org/r/1217184 (owner: 10Dpogorzelski) [13:59:37] (03PS3) 10Btullis: Add two basic spark pod templates in a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217183 (https://phabricator.wikimedia.org/T406833) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1400). [14:00:05] arlolra and stephanebisson: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] (03PS1) 10Blake: service: add exclude_from_switchover field. [puppet] - 10https://gerrit.wikimedia.org/r/1217189 (https://phabricator.wikimedia.org/T412211) [14:00:11] o/ [14:00:15] o/ [14:00:17] (03CR) 10Blake: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217189 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [14:01:34] arlolra doesn't seem to be here. Should I start? (no i18n today) [14:01:57] yeah I think you can go ahead [14:02:10] hopefully the backport won’t take too long to merge ^^ [14:02:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217181 (https://phabricator.wikimedia.org/T384485) (owner: 10Sbisson) [14:04:17] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20251209 [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217181 (https://phabricator.wikimedia.org/T384485) (owner: 10Sbisson) [14:04:28] oh good ^^ [14:04:56] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1217181|CX3 Build 1.0.0+20251209 (T384485 T408845 T409332 T409337 T409338 T411779)]] [14:05:08] T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485 [14:05:09] T408845: Visual indicator that an article in a list is part of a nominated collection - https://phabricator.wikimedia.org/T408845 [14:05:10] T409332: Over represent nominated collection in 'all collections' suggestions - https://phabricator.wikimedia.org/T409332 [14:05:10] T409337: Include nominated collection suggestions for country filter - https://phabricator.wikimedia.org/T409337 [14:05:10] T409338: Include nominated collection suggestions for topic filter - https://phabricator.wikimedia.org/T409338 [14:05:11] T411779: Handle invalid featured collection name - https://phabricator.wikimedia.org/T411779 [14:06:32] (03PS2) 10Tiziano Fogli: icinga/external_monitoring: disable http-unauthorized check [puppet] - 10https://gerrit.wikimedia.org/r/1217193 (https://phabricator.wikimedia.org/T393625) [14:06:32] (03CR) 10Tiziano Fogli: "The vhost is now accessible from $network::constants::domain_networks, and the availability of the endpoint is checked by HetrixTools." [puppet] - 10https://gerrit.wikimedia.org/r/1217193 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [14:07:02] !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1217181|CX3 Build 1.0.0+20251209 (T384485 T408845 T409332 T409337 T409338 T411779)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:07:07] (03CR) 10Hnowlan: [C:03+1] "lgtm, might wait for scott to +1 just to see if it matches what he had in mind" [puppet] - 10https://gerrit.wikimedia.org/r/1217189 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [14:08:18] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216865 (https://phabricator.wikimedia.org/T409260) (owner: 10Dreamy Jazz) [14:08:43] !log sbisson@deploy2002 sbisson: Continuing with sync [14:08:44] (03PS2) 10Blake: service: add exclude_from_switchover field. [puppet] - 10https://gerrit.wikimedia.org/r/1217189 (https://phabricator.wikimedia.org/T412211) [14:09:44] (03PS3) 10Blake: service: add exclude_from_switchover field. [puppet] - 10https://gerrit.wikimedia.org/r/1217189 (https://phabricator.wikimedia.org/T412211) [14:10:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T410589)', diff saved to https://phabricator.wikimedia.org/P86500 and previous config saved to /var/cache/conftool/dbconfig/20251210-141022-ladsgroup.json [14:10:26] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [14:10:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2228.codfw.wmnet with reason: Maintenance [14:10:41] (03CR) 10Blake: "Gotcha, will do." [puppet] - 10https://gerrit.wikimedia.org/r/1217189 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [14:10:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2228 (T410589)', diff saved to https://phabricator.wikimedia.org/P86501 and previous config saved to /var/cache/conftool/dbconfig/20251210-141046-ladsgroup.json [14:11:06] (03CR) 10Ssingh: ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [14:13:57] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217181|CX3 Build 1.0.0+20251209 (T384485 T408845 T409332 T409337 T409338 T411779)]] (duration: 09m 01s) [14:14:07] T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485 [14:14:07] T408845: Visual indicator that an article in a list is part of a nominated collection - https://phabricator.wikimedia.org/T408845 [14:14:07] T409332: Over represent nominated collection in 'all collections' suggestions - https://phabricator.wikimedia.org/T409332 [14:14:09] T409337: Include nominated collection suggestions for country filter - https://phabricator.wikimedia.org/T409337 [14:14:09] T409338: Include nominated collection suggestions for topic filter - https://phabricator.wikimedia.org/T409338 [14:14:10] T411779: Handle invalid featured collection name - https://phabricator.wikimedia.org/T411779 [14:17:20] "(duration: 09m 01s)" - impressive [14:17:52] stephanebisson: let me know when you're done [14:18:03] arlolra I'm done [14:18:08] thanks [14:18:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216674 (https://phabricator.wikimedia.org/T408482) (owner: 10MacFan4000) [14:20:01] (03Merged) 10jenkins-bot: ExtensionDistributor: mark 1.45 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216674 (https://phabricator.wikimedia.org/T408482) (owner: 10MacFan4000) [14:20:21] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1216674|ExtensionDistributor: mark 1.45 as stable (T408482)]] [14:20:25] T408482: Mark REL1_45 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T408482 [14:21:39] (03CR) 10Kosta Harlan: [C:03+1] Define config for v2 of suggested investigations instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216865 (https://phabricator.wikimedia.org/T409260) (owner: 10Dreamy Jazz) [14:22:16] !log arlolra@deploy2002 arlolra, macfan4000: Backport for [[gerrit:1216674|ExtensionDistributor: mark 1.45 as stable (T408482)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:22:51] !log arlolra@deploy2002 arlolra, macfan4000: Continuing with sync [14:24:27] (03PS2) 10Samtar: Set wgEnableWatchlistLabels for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217124 (https://phabricator.wikimedia.org/T411836) [14:26:34] Lucas_WMDE: still around? I have https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1217124 to deploy but currently don't have deployment access from this laptop - any chance you can? [14:26:40] sure [14:26:41] * Lucas_WMDE looks [14:26:50] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216674|ExtensionDistributor: mark 1.45 as stable (T408482)]] (duration: 06m 29s) [14:26:54] T408482: Mark REL1_45 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T408482 [14:28:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217124 (https://phabricator.wikimedia.org/T411836) (owner: 10Samtar) [14:28:38] oooooh sounds shiny [14:28:50] so basically you can have more than one watchlist? [14:29:03] "kinda" [14:29:05] “pages where I want to see changes every day”, “pages I might want to check once a fortnight” etc [14:29:14] arlolra: are you done deploying? :) [14:29:16] (03CR) 10Elukey: [C:03+1] icinga/external_monitoring: disable http-unauthorized check [puppet] - 10https://gerrit.wikimedia.org/r/1217193 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [14:30:12] (03CR) 10Tiziano Fogli: [C:03+2] icinga/external_monitoring: disable http-unauthorized check [puppet] - 10https://gerrit.wikimedia.org/r/1217193 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [14:34:25] arlolra: are you still deploying or can I take over? [14:36:27] All yours [14:36:31] thanks! [14:36:54] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Nokia: how to approach schema differences in SR-Linux versions - https://phabricator.wikimedia.org/T412157#11447522 (10cmooney) >>! In T412157#11446599, @ayounsi wrote: > I think the ideal would be to store all the OS versions (Debian, Juni... [14:37:26] (03CR) 10Lucas Werkmeister (WMDE): Set wgEnableWatchlistLabels for beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217124 (https://phabricator.wikimedia.org/T411836) (owner: 10Samtar) [14:37:32] TheresNoTime: ^ optional comment there [14:37:42] (03PS8) 10Slyngshede: C:mtail update trafficserver_backend_requests_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) [14:38:40] Lucas_WMDE: good point, will do, one moment [14:38:50] ok! [14:39:33] (03PS3) 10Samtar: Set wgEnableWatchlistLabels for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217124 (https://phabricator.wikimedia.org/T411836) [14:39:52] (03CR) 10CI reject: [V:04-1] C:mtail update trafficserver_backend_requests_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [14:40:05] Lucas_WMDE: done [14:40:10] * Lucas_WMDE looks [14:40:22] thanks! [14:40:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217124 (https://phabricator.wikimedia.org/T411836) (owner: 10Samtar) [14:41:21] (03Merged) 10jenkins-bot: Set wgEnableWatchlistLabels for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217124 (https://phabricator.wikimedia.org/T411836) (owner: 10Samtar) [14:41:39] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1217124|Set wgEnableWatchlistLabels for beta (T411836)]] [14:41:43] T411836: Deploy to a Beta wiki - https://phabricator.wikimedia.org/T411836 [14:43:18] (03PS3) 10Cathal Mooney: Nokia: add support for SR-Linux v25 or v24 [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) [14:43:38] (03CR) 10Cathal Mooney: Nokia: add support for SR-Linux v25 or v24 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) (owner: 10Cathal Mooney) [14:44:01] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, samtar: Backport for [[gerrit:1217124|Set wgEnableWatchlistLabels for beta (T411836)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:44:07] * TheresNoTime looking [14:44:36] well, not much to test except that the feature isn’t enabled in prod enwiki, I guess? ^^ [14:44:37] (03CR) 10CI reject: [V:04-1] Nokia: add support for SR-Linux v25 or v24 [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) (owner: 10Cathal Mooney) [14:44:55] Lucas_WMDE: lgtm [14:45:02] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, samtar: Continuing with sync [14:45:04] ok, thanks! [14:45:48] (03CR) 10Dillon: [C:03+1] Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216804 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [14:48:11] (03PS9) 10Slyngshede: C:mtail update trafficserver_backend_requests_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) [14:48:29] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [14:49:00] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217124|Set wgEnableWatchlistLabels for beta (T411836)]] (duration: 07m 21s) [14:49:03] T411836: Deploy to a Beta wiki - https://phabricator.wikimedia.org/T411836 [14:49:04] (03PS4) 10Cathal Mooney: Nokia: add support for SR-Linux v25 or v24 [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) [14:49:20] Lucas_WMDE: thank you! [14:49:29] np! feel free to ping me if it needs to be reverted to unbreak beta ^^ [14:49:33] !log UTC afternoon backport+config window done [14:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:37] (optimistically ;)) [14:50:17] (03CR) 10CI reject: [V:04-1] C:mtail update trafficserver_backend_requests_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [14:50:57] (03CR) 10Btullis: [C:03+2] Add two basic spark pod templates in a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217183 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [14:51:20] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:51:47] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:51:56] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:52:00] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:52:43] (03Merged) 10jenkins-bot: Add two basic spark pod templates in a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217183 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [14:52:44] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [14:52:53] (03PS5) 10Cathal Mooney: Nokia: add support for SR-Linux v25 or v24 [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) [14:53:31] (03PS10) 10Slyngshede: C:mtail update trafficserver_backend_requests_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) [14:54:07] (03PS6) 10Cathal Mooney: Nokia: add support for SR-Linux v25 or v24 [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) [14:54:39] jclark@cumin1003 provision (PID 3181277) is awaiting input [14:55:03] jclark@cumin1003 provision (PID 3181095) is awaiting input [14:55:07] jclark@cumin1003 provision (PID 3180870) is awaiting input [14:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:56:01] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [14:56:08] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [14:56:18] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt aqs servers - jclark@cumin1003" [14:56:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt aqs servers - jclark@cumin1003" [14:56:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:56:35] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-12-03-005631 to 2025-12-08-185405 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217214 (https://phabricator.wikimedia.org/T381137) [14:56:41] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-12-02-224740 to 2025-12-10-133418 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217215 (https://phabricator.wikimedia.org/T381137) [14:56:43] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:56:49] (03CR) 10Slyngshede: C:mtail update trafficserver_backend_requests_seconds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [14:56:54] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:57:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:59:08] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1500) [15:00:56] jclark@cumin1003 provision (PID 3181277) is awaiting input [15:01:03] jclark@cumin1003 provision (PID 3181095) is awaiting input [15:03:29] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2025-12-03-005631 to 2025-12-08-185405 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217214 (https://phabricator.wikimedia.org/T381137) (owner: 10Jforrester) [15:05:21] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-12-03-005631 to 2025-12-08-185405 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217214 (https://phabricator.wikimedia.org/T381137) (owner: 10Jforrester) [15:06:45] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:06:50] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:06:53] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:07:35] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:08:02] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:08:41] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:08:47] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:09:19] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:09:29] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:09:31] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:09:42] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:09:54] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:10:02] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:07] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-12-02-224740 to 2025-12-10-133418 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217215 (https://phabricator.wikimedia.org/T381137) (owner: 10Jforrester) [15:10:16] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:10:26] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:10:33] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:10:45] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:12:12] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-12-02-224740 to 2025-12-10-133418 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217215 (https://phabricator.wikimedia.org/T381137) (owner: 10Jforrester) [15:13:32] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:13:54] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:14:23] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:14:55] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:15:03] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:15:30] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:18:06] 10ops-codfw, 06DC-Ops, 10observability: Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229 (10RobH) 03NEW [15:18:24] 10ops-codfw, 06DC-Ops, 10observability: Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11447739 (10RobH) [15:18:52] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:18:57] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-12-08-185405 to 2025-12-10-150641 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217219 (https://phabricator.wikimedia.org/T406848) [15:19:17] 10ops-eqiad, 06DC-Ops, 10observability: Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230 (10RobH) 03NEW [15:19:38] 10ops-codfw, 06DC-Ops, 10observability: Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11447759 (10RobH) [15:20:16] 10ops-eqiad, 06DC-Ops, 10observability: Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11447763 (10RobH) [15:20:38] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2025-12-08-185405 to 2025-12-10-150641 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217219 (https://phabricator.wikimedia.org/T406848) (owner: 10Jforrester) [15:20:51] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-sd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:21:32] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host logging-sd1006 [15:21:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logging-sd1006 [15:21:48] (03CR) 10Ayounsi: [C:03+1] Nokia: add support for SR-Linux v25 or v24 [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) (owner: 10Cathal Mooney) [15:22:17] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host logging-sd1007 [15:22:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logging-sd1007 [15:23:00] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-12-08-185405 to 2025-12-10-150641 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217219 (https://phabricator.wikimedia.org/T406848) (owner: 10Jforrester) [15:23:19] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:23:29] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host logging-sd1005 [15:23:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logging-sd1005 [15:24:00] jouncebot: nowandnext [15:24:00] For the next 0 hour(s) and 35 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1500) [15:24:00] In 0 hour(s) and 5 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1530) [15:24:00] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:24:21] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:24:39] Dreamy_Jazz: We're not touching MW-land. Not sure about the xLab team. [15:24:40] (03PS1) 10Elukey: sre.hosts.provision: add custom config for the new aqs Supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1217222 (https://phabricator.wikimedia.org/T407032) [15:24:56] Thanks. Just want to merge a config patch [15:25:03] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:25:11] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:25:35] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aqs1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:25:54] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:26:25] Going to proceed with it now [15:26:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216865 (https://phabricator.wikimedia.org/T409260) (owner: 10Dreamy Jazz) [15:27:29] (03Merged) 10jenkins-bot: Define config for v2 of suggested investigations instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216865 (https://phabricator.wikimedia.org/T409260) (owner: 10Dreamy Jazz) [15:27:48] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1216865|Define config for v2 of suggested investigations instrument (T409260)]] [15:27:51] T409260: Instrumentation from SI case -> Action taken - https://phabricator.wikimedia.org/T409260 [15:29:42] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1216865|Define config for v2 of suggested investigations instrument (T409260)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1500) [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1530) [15:30:31] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add custom config for the new aqs Supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1217222 (https://phabricator.wikimedia.org/T407032) (owner: 10Elukey) [15:30:36] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [15:33:00] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:34:34] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216865|Define config for v2 of suggested investigations instrument (T409260)]] (duration: 06m 47s) [15:34:38] T409260: Instrumentation from SI case -> Action taken - https://phabricator.wikimedia.org/T409260 [15:34:53] I'm done [15:35:02] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:31] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11447836 (10Dzahn) Great, thanks for forwarding the questions, ATitkov. Yea, it would be crucial to know if the currently existing redirect (... [15:40:02] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1002 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [15:41:02] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1002 is OK: PROCS OK: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [15:41:49] (03CR) 10Cathal Mooney: [C:03+2] Nokia: add support for SR-Linux v25 or v24 [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) (owner: 10Cathal Mooney) [15:43:03] (03Merged) 10jenkins-bot: Nokia: add support for SR-Linux v25 or v24 [homer/public] - 10https://gerrit.wikimedia.org/r/1216869 (https://phabricator.wikimedia.org/T412157) (owner: 10Cathal Mooney) [15:49:35] (03CR) 10Dzahn: [C:04-1] "still not clear if there was coordination between the 2 requestors of things to happen to this domain ... now we have to wait for https://" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [15:51:46] (03CR) 10Dzahn: [V:03+1 C:03+1] "at this point everything else (that is left) means:" [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [15:53:08] (03CR) 10Dzahn: ats: gerrit: don't validate TLS host for now [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [15:55:07] (03PS4) 10Dpogorzelski: ml-build: add hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1217184 [15:55:26] (03CR) 10Dpogorzelski: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1217184 (owner: 10Dpogorzelski) [15:56:12] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1216679 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [15:56:49] (03CR) 10Cathal Mooney: [C:03+1] Comment out temporarily the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1216677 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [15:57:33] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217229 [15:58:27] (03CR) 10Elukey: [C:03+1] "left a note, feel free to proceed after it!" [puppet] - 10https://gerrit.wikimedia.org/r/1217184 (owner: 10Dpogorzelski) [16:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:10:18] (03PS5) 10Dpogorzelski: ml-build: add hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1217184 [16:10:39] (03PS1) 10Bking: opensearch on k8s: Add DC-specific records [dns] - 10https://gerrit.wikimedia.org/r/1217230 (https://phabricator.wikimedia.org/T410956) [16:10:47] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host logging-sd2006 [16:10:54] (03CR) 10Dpogorzelski: ml-build: add hieradata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1217184 (owner: 10Dpogorzelski) [16:10:55] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host logging-sd2007 [16:10:58] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logging-sd2006 [16:11:02] (03CR) 10Dpogorzelski: [C:03+2] ml-build: add hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1217184 (owner: 10Dpogorzelski) [16:11:03] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logging-sd2007 [16:11:34] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:14:56] jhancock@cumin1003 provision (PID 3264856) is awaiting input [16:19:17] (03CR) 10Gehel: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1217230 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [16:21:11] (03PS1) 10Bking: opensearch on k8s: add DC-specific endpoint domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217234 (https://phabricator.wikimedia.org/T410956) [16:22:46] (03CR) 10Dzahn: [C:04-1] "also on hold" [puppet] - 10https://gerrit.wikimedia.org/r/1216855 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:23:17] (03CR) 10Dzahn: "setting back to WIP - but you know it's here" [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [16:23:35] (03CR) 10Brouberol: [C:03+1] opensearch on k8s: Add DC-specific records [dns] - 10https://gerrit.wikimedia.org/r/1217230 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [16:24:42] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp5030 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:24:58] uh? what? [16:25:31] I am looking [16:25:42] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp5030 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:25:45] ??? [16:26:01] fabfur: it recovered [16:26:02] but looking [16:26:08] ack [16:26:15] (03CR) 10Bking: [C:03+2] opensearch on k8s: Add DC-specific records [dns] - 10https://gerrit.wikimedia.org/r/1217230 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [16:26:32] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.reimage for host ml-build1001.eqiad.wmnet with OS trixie [16:26:46] !log bking@dns1004 START - running authdns-update [16:27:44] !log bking@dns1004 END - running authdns-update [16:27:57] fabfur: I am guessing monitoring glitch [16:27:59] can't see anything [16:27:59] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:28:13] (03CR) 10Brouberol: [C:03+1] opensearch on k8s: add DC-specific endpoint domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217234 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [16:28:41] (03CR) 10Bking: [C:03+2] opensearch on k8s: add DC-specific endpoint domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217234 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [16:30:36] (03Merged) 10jenkins-bot: opensearch on k8s: add DC-specific endpoint domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217234 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [16:30:49] jouncebot: nowandnext [16:30:49] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [16:30:49] In 1 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1800) [16:30:53] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:31:13] sukhe: i see you were debugging an issue, double checking it's OK to make a deployment now? [16:31:44] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:32:21] urbanecm: yep, no issues, it recovered. please go ahead. [16:32:26] thanks! [16:32:41] (03CR) 10Urbanecm: [C:03+2] Confirmation email: further styling adjustments [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216867 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [16:32:59] (03PS1) 10Urbanecm: i18n: replace <> to avoid false positive export errors [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217235 [16:33:02] (03CR) 10Urbanecm: [C:03+2] i18n: replace <> to avoid false positive export errors [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217235 (owner: 10Urbanecm) [16:36:22] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:39:49] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:40:39] (03PS1) 10Gehel: WDQS: introduce a new role to test Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) [16:40:59] (03PS2) 10LorenMora: [Legal Footer] Deploy Legal Footer for Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216839 (https://phabricator.wikimedia.org/T410164) [16:42:13] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host logging-sd2006.codfw.wmnet with OS bookworm [16:42:20] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd200[567] - https://phabricator.wikimedia.org/T406795#11448104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host logging-sd2006.codfw.wmnet with OS bookworm [16:42:29] (03PS2) 10Gehel: WDQS: introduce a new role to test Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) [16:42:41] (03CR) 10LorenMora: "Thank you. I added the comment. This is being released in phases starting with the highest legal risk wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216839 (https://phabricator.wikimedia.org/T410164) (owner: 10LorenMora) [16:42:47] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [16:43:21] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-build1001.eqiad.wmnet with reason: host reimage [16:44:43] (03CR) 10Ssingh: "1. I did a quick check and there isn't anything specific to Gerrit in VCL. I will do another pass." [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [16:46:01] (03CR) 10Btullis: [C:03+1] WDQS: introduce a new role to test Blazegraph alternatives (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [16:46:16] (03CR) 10Btullis: WDQS: introduce a new role to test Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [16:47:23] PROBLEM - Docker registry HTTPS interface on registry1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [16:48:03] (03Merged) 10jenkins-bot: Confirmation email: further styling adjustments [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216867 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [16:48:08] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-build1001.eqiad.wmnet with reason: host reimage [16:48:14] RECOVERY - Docker registry HTTPS interface on registry1005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Docker [16:48:21] (03Merged) 10jenkins-bot: i18n: replace <> to avoid false positive export errors [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217235 (owner: 10Urbanecm) [16:48:26] (03CR) 10Btullis: WDQS: introduce a new role to test Blazegraph alternatives (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [16:49:12] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1216867|Confirmation email: further styling adjustments (T411526)]], [[gerrit:1217235|i18n: replace <> to avoid false positive export errors]] [16:49:15] T411526: Improve CSS styling for verification email - https://phabricator.wikimedia.org/T411526 [16:53:29] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd2006.codfw.wmnet with reason: host reimage [16:59:03] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd2006.codfw.wmnet with reason: host reimage [17:02:25] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host logging-sd2005.codfw.wmnet with OS bookworm [17:02:34] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd200[567] - https://phabricator.wikimedia.org/T406795#11448220 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host logging-sd2005.codfw.wmnet with OS bookworm [17:02:38] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host logging-sd2007.codfw.wmnet with OS bookworm [17:02:46] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd200[567] - https://phabricator.wikimedia.org/T406795#11448221 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host logging-sd2007.codfw.wmnet with OS bookworm [17:03:24] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-build1001.eqiad.wmnet with OS trixie [17:03:51] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-build1001.eqiad.wmnet with OS trixie [17:13:23] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd2007.codfw.wmnet with reason: host reimage [17:16:11] (03CR) 10Bking: WDQS: introduce a new role to test Blazegraph alternatives (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [17:16:35] (03PS1) 10Clare Ming: Test Kitchen UI: Deploying v1.1.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217244 (https://phabricator.wikimedia.org/T407805) [17:16:36] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [17:16:56] (03CR) 10Bking: WDQS: introduce a new role to test Blazegraph alternatives (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [17:17:20] (03CR) 10Bking: [C:03+1] WDQS: introduce a new role to test Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [17:17:58] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd2007.codfw.wmnet with reason: host reimage [17:18:14] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [17:18:15] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd2006.codfw.wmnet with OS bookworm [17:18:22] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd200[567] - https://phabricator.wikimedia.org/T406795#11448268 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host logging-sd2006.codfw.wmnet with OS bookworm completed: - lo... [17:18:33] (03PS1) 10Clare Ming: Test Kitchen UI: Deploying v1.1.4 release to staging Test Kitchen UI: Deploying v1.1.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217246 (https://phabricator.wikimedia.org/T407805) [17:19:05] (03PS2) 10Clare Ming: Test Kitchen UI: Deploying v1.1.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217246 (https://phabricator.wikimedia.org/T407805) [17:35:19] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [17:38:23] jhancock@cumin1003 reimage (PID 3318763) is awaiting input [17:44:19] !log urbanecm@deploy2002 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.4,1.46.0-wmf.5,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/me [17:44:19] diawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.229.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediaw [17:44:19] iki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 55m 06s) [17:44:34] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd2005.codfw.wmnet with reason: host reimage [17:44:58] wat? [17:46:30] trying again... [17:46:44] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1216867|Confirmation email: further styling adjustments (T411526)]], [[gerrit:1217235|i18n: replace <> to avoid false positive export errors]] [17:46:47] T411526: Improve CSS styling for verification email - https://phabricator.wikimedia.org/T411526 [17:48:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd2005.codfw.wmnet with reason: host reimage [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T1800) [18:06:00] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [18:09:04] jhancock@cumin1003 reimage (PID 3318276) is awaiting input [18:26:28] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [18:26:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd2007.codfw.wmnet with OS bookworm [18:26:59] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [18:27:00] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd2005.codfw.wmnet with OS bookworm [18:38:16] !log urbanecm@deploy2002 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.4,1.46.0-wmf.5,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/me [18:38:16] diawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.229.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediaw [18:38:16] iki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 51m 32s) [18:39:46] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1216681 (owner: 10Ncmonitor) [18:39:50] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1216682 (owner: 10Ncmonitor) [18:40:35] hmm...still same error. reverting... [18:41:01] (03PS1) 10Urbanecm: Revert "Confirmation email: further styling adjustments" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217258 (https://phabricator.wikimedia.org/T411526) [18:41:05] (03PS1) 10Urbanecm: Revert "i18n: replace <> to avoid false positive export errors" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217259 (https://phabricator.wikimedia.org/T411526) [18:41:15] (03CR) 10Urbanecm: [V:03+2 C:03+2] Revert "Confirmation email: further styling adjustments" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217258 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [18:41:15] (03CR) 10Urbanecm: [V:03+2 C:03+2] Revert "i18n: replace <> to avoid false positive export errors" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217259 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [18:43:13] okay, reverted, will debug tomorrow [18:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:58:31] (03CR) 10Jdlrobson: [C:03+1] [Legal Footer] Deploy Legal Footer for Phase 1 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216839 (https://phabricator.wikimedia.org/T410164) (owner: 10LorenMora) [19:02:48] (03PS1) 10CDobbins: WIP for depool alerting [alerts] - 10https://gerrit.wikimedia.org/r/1217262 [19:03:29] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs1018.eqiad.wmnet [19:03:31] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1018.eqiad.wmnet [19:03:51] !log disable BGP on cr1-eqiad and cr2-eqiad to lvs1018 to fail over to lvs1020 (T411781) [19:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:54] T411781: lvs1018: remove cross-rack links to rows A, C and D - https://phabricator.wikimedia.org/T411781 [19:04:00] (03CR) 10CI reject: [V:04-1] WIP for depool alerting [alerts] - 10https://gerrit.wikimedia.org/r/1217262 (owner: 10CDobbins) [19:06:05] !log stop pybal/puppet on lvs1018 (T411781) [19:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:28] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:06:30] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:07:23] ^known [19:08:41] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1018.eqiad.wmnet with reason: T411781 [19:13:53] (03PS2) 10Cathal Mooney: lvs1018: Remove vlan sub-interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1215565 (https://phabricator.wikimedia.org/T411781) [19:14:00] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215565 (https://phabricator.wikimedia.org/T411781) (owner: 10Cathal Mooney) [19:14:07] !log urbanecm@deploy2002 Started scap sync-world: test [19:14:20] (03PS2) 10CDobbins: WIP for depool alerting [alerts] - 10https://gerrit.wikimedia.org/r/1217262 [19:14:52] (03CR) 10BCornwall: [C:03+1] lvs1018: Remove vlan sub-interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1215565 (https://phabricator.wikimedia.org/T411781) (owner: 10Cathal Mooney) [19:15:32] (03CR) 10CI reject: [V:04-1] WIP for depool alerting [alerts] - 10https://gerrit.wikimedia.org/r/1217262 (owner: 10CDobbins) [19:17:56] (03CR) 10Cathal Mooney: [C:03+2] lvs1018: Remove vlan sub-interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1215565 (https://phabricator.wikimedia.org/T411781) (owner: 10Cathal Mooney) [19:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:21:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216839 (https://phabricator.wikimedia.org/T410164) (owner: 10LorenMora) [19:25:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:29:46] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1018.eqiad.wmnet with OS bullseye [19:42:35] (03CR) 10Ryan Kemper: [C:03+2] wdqs: correct deploy tag and add codfw as site [alerts] - 10https://gerrit.wikimedia.org/r/1216825 (https://phabricator.wikimedia.org/T389859) (owner: 10Ryan Kemper) [19:43:46] (03Merged) 10jenkins-bot: wdqs: correct deploy tag and add codfw as site [alerts] - 10https://gerrit.wikimedia.org/r/1216825 (https://phabricator.wikimedia.org/T389859) (owner: 10Ryan Kemper) [19:44:15] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1018.eqiad.wmnet with reason: host reimage [19:44:44] (03PS1) 10Eevans: data-gateway: move v1.0.14 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217267 (https://phabricator.wikimedia.org/T410962) [19:47:16] (03CR) 10Eevans: [C:03+2] data-gateway: move v1.0.14 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217267 (https://phabricator.wikimedia.org/T410962) (owner: 10Eevans) [19:48:44] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1018.eqiad.wmnet with reason: host reimage [19:49:02] (03Merged) 10jenkins-bot: data-gateway: move v1.0.14 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217267 (https://phabricator.wikimedia.org/T410962) (owner: 10Eevans) [19:51:57] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [19:52:15] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [19:52:56] !log eevans@deploy2002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [19:53:27] !log eevans@deploy2002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [19:55:09] (03PS1) 10BCornwall: wikimediafoundation.org: Add AAAA record [dns] - 10https://gerrit.wikimedia.org/r/1217268 (https://phabricator.wikimedia.org/T403269) [19:56:50] is the docker registry all-right? my scap is trying to push a new image for the last 40 minutes or so... [19:58:46] (03PS6) 10Ryan Kemper: sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [19:59:47] (03CR) 10Ryan Kemper: sre.data engineering cookbooks: use get_subset (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [20:00:16] (03CR) 10Ryan Kemper: [C:03+2] wdqs: add availability sli recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [20:05:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1018.eqiad.wmnet with OS bullseye [20:09:58] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:10:09] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:10:36] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:11:48] (03CR) 10Cathal Mooney: [C:03+2] Eqiad C/D: Remove ESI-LAG config for Nokia connections to Juniper VCs [homer/public] - 10https://gerrit.wikimedia.org/r/1216802 (https://phabricator.wikimedia.org/T411781) (owner: 10Cathal Mooney) [20:13:39] !log Remove 2x40G LAGs between ssw1-d1-eqiad ssw1-d8-eqiad and asw2-c-eqiad asw2-d-eqiad [20:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:23] (03CR) 10Ssingh: [C:03+1] wikimediafoundation.org: Add AAAA record [dns] - 10https://gerrit.wikimedia.org/r/1217268 (https://phabricator.wikimedia.org/T403269) (owner: 10BCornwall) [20:17:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:18:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:18:28] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:18:55] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1023.eqiad.wmnet with OS bullseye [20:19:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11448902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host aqs1023.eqiad.wmnet with OS bullseye [20:22:59] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1024.eqiad.wmnet with OS bullseye [20:23:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11448916 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host aqs1024.eqiad.wmnet with OS bullseye [20:23:14] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1026.eqiad.wmnet with OS bullseye [20:23:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11448917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host aqs1026.eqiad.wmnet with OS bullseye [20:23:25] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1027.eqiad.wmnet with OS bullseye [20:23:25] 06SRE, 10MW-on-K8s: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11448919 (10taavi) Nothing immediately obvious in the registry logs: ` Dec 10 18:38:16 registry2005 docker-registry[608]: time="2025-12-10T18:38:16.115983796Z" level=error msg="er... [20:23:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11448920 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host aqs1027.eqiad.wmnet with OS bullseye [20:25:40] 06SRE, 10MW-on-K8s: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11448921 (10taavi) Grepping for the request ID shows this: ` Dec 10 18:38:16 registry2005 docker-registry[608]: time="2025-12-10T18:38:16.115838953Z" level=error msg="unknown erro... [20:26:26] 06SRE, 06Infrastructure-Foundations, 10netops: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800#11448922 (10cmooney) 05Open→03Resolved Folks I am going to close this one for now. The mysterious issue has not re-occured, and significantly we have now decommisioned the laye... [20:29:05] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1023.eqiad.wmnet with reason: host reimage [20:29:08] (03PS3) 10CDobbins: WIP for depool alerting [alerts] - 10https://gerrit.wikimedia.org/r/1217262 [20:29:38] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploying v1.1.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217244 (https://phabricator.wikimedia.org/T407805) (owner: 10Clare Ming) [20:29:45] (03PS1) 10Cathal Mooney: Nokia ESI-LAG: Adjust module to fully remove when last LAG deleted [homer/public] - 10https://gerrit.wikimedia.org/r/1217270 [20:30:38] !log urbanecm@deploy2002 Finished scap sync-world: test (duration: 76m 31s) [20:30:49] at least, something finished [20:31:13] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploying v1.1.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217244 (https://phabricator.wikimedia.org/T407805) (owner: 10Clare Ming) [20:31:49] !log [WDQS] `ryankemper@wdqs1014:~$ sudo systemctl restart wdqs-blazegraph` to unstick deadlock [20:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lvs1018: remove cross-rack links to rows A, C and D - https://phabricator.wikimedia.org/T411781#11448968 (10cmooney) All the ports are now decom'ed on the switches / servers. @Jclark-ctr when you are ready you can remove the three lin... [20:33:18] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1024.eqiad.wmnet with reason: host reimage [20:33:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1023.eqiad.wmnet with reason: host reimage [20:33:21] (03PS1) 10Ryan Kemper: Fix typo (s/Cound/Count) [alerts] - 10https://gerrit.wikimedia.org/r/1217272 [20:33:38] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1027.eqiad.wmnet with reason: host reimage [20:33:46] (03PS2) 10Ryan Kemper: Fix typo (s/Cound/Count) [alerts] - 10https://gerrit.wikimedia.org/r/1217272 (https://phabricator.wikimedia.org/T389859) [20:33:50] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1026.eqiad.wmnet with reason: host reimage [20:33:51] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11448969 (10cmooney) 05Open→03Resolved This is done, or at least we have all the major coverage we need. [20:34:30] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588#11448975 (10cmooney) 05Open→03Resolved [20:34:40] 06SRE, 10MW-on-K8s: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11448977 (10Urbanecm_WMF) Okay, so the final sync-world has finished: `lang=irc 21:30 <+logmsgbot> !log urbanecm@deploy2002 Finished scap sync-world: test (duration: 76m 31s) `... [20:35:02] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: Add Python modules to configure Nokia SR Linux switches - https://phabricator.wikimedia.org/T402577#11448978 (10cmooney) 05Open→03Resolved a:03cmooney There will be more work to refine the configuration and add elements over time, but closing th... [20:35:48] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11448982 (10cmooney) 05Open→03Resolved [20:36:06] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Create script to allow multiple host migrations from old -> new switch - https://phabricator.wikimedia.org/T405640#11448984 (10cmooney) 05Open→03Resolved [20:36:09] 06SRE, 10MW-on-K8s: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11448986 (10Urbanecm_WMF) Kept the (somehow-successful) scap-image-build-and-push-log at `deploy2002:/home/urbanecm/scap-image-build-and-push-log-T412265`, if it is useful for any... [20:36:09] (03CR) 10Bking: [C:03+2] Fix typo (s/Cound/Count) [alerts] - 10https://gerrit.wikimedia.org/r/1217272 (https://phabricator.wikimedia.org/T389859) (owner: 10Ryan Kemper) [20:37:18] (03Merged) 10jenkins-bot: Fix typo (s/Cound/Count) [alerts] - 10https://gerrit.wikimedia.org/r/1217272 (https://phabricator.wikimedia.org/T389859) (owner: 10Ryan Kemper) [20:37:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1024.eqiad.wmnet with reason: host reimage [20:40:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Nokia EX/QFX switches in eqiad rows C/D - https://phabricator.wikimedia.org/T412271 (10cmooney) 03NEW p:05Triage→03Medium [20:40:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Juniper EX/QFX switches in eqiad rows C/D - https://phabricator.wikimedia.org/T412271#11449020 (10cmooney) [20:41:05] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1027.eqiad.wmnet with reason: host reimage [20:42:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Juniper EX/QFX switches in eqiad rows C/D - https://phabricator.wikimedia.org/T412271#11449025 (10cmooney) [20:42:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11449026 (10cmooney) [20:44:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1026.eqiad.wmnet with reason: host reimage [20:47:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T410589)', diff saved to https://phabricator.wikimedia.org/P86506 and previous config saved to /var/cache/conftool/dbconfig/20251210-204712-ladsgroup.json [20:47:17] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [20:48:10] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:49:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:49:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1023.eqiad.wmnet with OS bullseye [20:49:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11449042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host aqs1023.eqiad.wmnet with OS bullseye completed: - aqs1023 (**PASS**) -... [20:52:04] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:55:09] jclark@cumin1003 reimage (PID 3385277) is awaiting input [20:57:23] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T2100). [21:00:05] sbassett, JSherman, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:11] o/ [21:00:18] o/ [21:00:27] jclark@cumin1003 reimage (PID 3385290) is awaiting input [21:00:32] !log jclark@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:00:33] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1027.eqiad.wmnet with OS bullseye [21:00:34] Mine aren't testable until a later config change, and so can be bundled in with others. [21:00:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:00:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1024.eqiad.wmnet with OS bullseye [21:00:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11449052 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host aqs1027.eqiad.wmnet with OS bullseye completed: - aqs1027 (**WARN**) -... [21:00:43] rzl: taavi: I assume we should probably wait a bit until investigation follow? [21:00:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11449053 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host aqs1024.eqiad.wmnet with OS bullseye completed: - aqs1024 (**PASS**) -... [21:01:01] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:01:13] See T412265 for context [21:01:14] T412265: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265 [21:01:33] o/ [21:01:56] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:01:57] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1026.eqiad.wmnet with OS bullseye [21:02:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11449054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host aqs1026.eqiad.wmnet with OS bullseye completed: - aqs1026 (**PASS**) -... [21:02:10] urbanecm: nothing actively in progress -- I'd be inclined to say, try a regular deploy and see if it's still in a bad state, if so we can look into things like service restarts [21:02:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P86507 and previous config saved to /var/cache/conftool/dbconfig/20251210-210220-ladsgroup.json [21:02:34] Kemayo: happy to bundle yours with mine, which is pretty straightforward [21:02:47] rzl: regular as in, something else than the patches I tried earlier? [21:03:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:03:08] JSherman: that sounds fine to me [21:03:23] urbanecm: whatever the cause is, I would be surprised if it's sensitive to the content of the patch [21:03:42] Ack [21:03:51] rzl: urbanecm: are we good to start the backport window? [21:04:14] Id say so! Im unable to deploy now, but feel free to start [21:04:35] no concerns from me but note previous deployers saw some docker registry errors -- if you do too, I can see what I can find out [21:04:36] sbassett: can you self deploy, or would like me to deploy for you? [21:04:47] I guess I can [21:04:50] rzl: ack; thanks! [21:05:00] I'm not a docker registry expert, I'm just the team member who isn't at an offsite right now :) [21:05:18] noted [21:05:19] How dare I fill issues during offsites :)) [21:05:55] It sounds like it's all yours then, sbassett: [21:06:13] sure, is urbanecm still wrapping some things up? see some error states in spiderpig... [21:06:27] sbassett: that's the errors rzl mentioned [21:06:27] 👀 [21:06:35] Try yours and either it will work or it wont [21:06:42] heh, ok [21:06:46] Or the secret third thing. [21:06:51] I'm *definitely* not a spiderpig expert so if anything needs untangling there, it's a question for releng [21:06:57] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11449063 (10Jhancock.wm) [21:06:58] (03CR) 10SBassett: [C:03+1] Set CSP Report Only mode for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216660 (https://phabricator.wikimedia.org/T291867) (owner: 10SBassett) [21:07:10] (as it well might, if the previous deployment ended in a bad state) [21:07:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216660 (https://phabricator.wikimedia.org/T291867) (owner: 10SBassett) [21:07:31] running… [21:07:38] Fingers crossed! [21:07:58] there’s a chance my config patch might annoy logs/logstash, so that’s another thing I need to watch [21:08:04] (03Merged) 10jenkins-bot: Set CSP Report Only mode for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216660 (https://phabricator.wikimedia.org/T291867) (owner: 10SBassett) [21:08:25] !log sbassett@deploy2002 Started scap sync-world: Backport for [[gerrit:1216660|Set CSP Report Only mode for group1 wikis (T291867)]] [21:12:19] !log sbassett@deploy2002 sbassett: Backport for [[gerrit:1216660|Set CSP Report Only mode for group1 wikis (T291867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:12:46] !log sbassett@deploy2002 sbassett: Continuing with sync [21:12:55] That looks promising [21:12:59] (03PS1) 10JHathaway: corto: add gdoc handover id [puppet] - 10https://gerrit.wikimedia.org/r/1217275 [21:13:13] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217275 (owner: 10JHathaway) [21:14:49] canaries almost done [21:16:38] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [21:16:54] Kemayo: I had a look over your patches, and they seem like a pretty straightforward instrumentation change; happy to deploy. I'm thinking I may actually do them separate from mine. sbassett's seems to be working and it's a config patch, as is mine. We might be able to learn a little bit by seeing if your actual backport fails while the config patches work; and that may lower the risk for my patch :) [21:17:24] I'm fine either way. I'm also happy to deploy them myself, if they're not getting bundled up. [21:17:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P86508 and previous config saved to /var/cache/conftool/dbconfig/20251210-211728-ladsgroup.json [21:17:34] ah, very good! [21:17:50] Kemayo: yeah, I'll handoff in that case [21:17:53] The offer to bundle was entirely to speed up the deployment window. :D [21:18:42] ack; the odd docker errors are making me more cautious [21:18:59] !log sbassett@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216660|Set CSP Report Only mode for group1 wikis (T291867)]] (duration: 10m 34s) [21:19:27] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11449105 (10VRiley-WMF) [21:20:05] So… seems good? [21:20:07] Looks like sbassett's has finished, so I will get going with just mine. [21:20:11] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti-jumbo2001-3 to codfw - jhancock@cumin1003" [21:20:15] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti-jumbo2001-3 to codfw - jhancock@cumin1003" [21:20:15] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:18] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti-jumbo2001 [21:20:19] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti-jumbo2002 [21:20:20] Keymayo: may I go first? [21:20:21] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti-jumbo2003 [21:20:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti-jumbo2001 [21:20:29] Kemayo: may I go first? [21:20:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti-jumbo2002 [21:20:32] JSherman: sure, go for it. [21:20:32] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti-jumbo2003 [21:20:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216804 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [21:21:28] (03Merged) 10jenkins-bot: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216804 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [21:21:48] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1216804|Enable revertrisk filters in thwiki (T409438)]] [21:21:51] T409438: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438 [21:21:54] (03PS1) 10JHathaway: corto: dummy gdoc id [labs/private] - 10https://gerrit.wikimedia.org/r/1217276 [21:21:55] testing [21:22:31] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-jumbo2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:22:34] (03CR) 10JHathaway: [C:03+2] corto: dummy gdoc id [labs/private] - 10https://gerrit.wikimedia.org/r/1217276 (owner: 10JHathaway) [21:22:38] (03CR) 10JHathaway: [V:03+2 C:03+2] corto: dummy gdoc id [labs/private] - 10https://gerrit.wikimedia.org/r/1217276 (owner: 10JHathaway) [21:23:18] (03PS2) 10JHathaway: corto: add gdoc handover id for POC [puppet] - 10https://gerrit.wikimedia.org/r/1217275 [21:23:36] oopsie, accidentally hit enter with my message queued up [21:23:45] !log jsn@deploy2002 kgraessle, jsn: Backport for [[gerrit:1216804|Enable revertrisk filters in thwiki (T409438)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:24:11] now testing [21:24:46] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1014.eqiad.wmnet with reason: catching up on lag [21:25:24] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-jumbo2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:25:30] !log jsn@deploy2002 kgraessle, jsn: Continuing with sync [21:25:40] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-jumbo2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:25:53] alrighty; looks good [21:27:09] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217275 (owner: 10JHathaway) [21:28:50] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217275 (owner: 10JHathaway) [21:30:39] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216804|Enable revertrisk filters in thwiki (T409438)]] (duration: 08m 51s) [21:30:43] T409438: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438 [21:30:43] Kemayo: all yours, thanks for your patience. Hopefully you don't hit the docker error after the image build! [21:31:43] JSherman: thanks, and we shall see... [21:31:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216613 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [21:31:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216883 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [21:31:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11449172 (10VRiley-WMF) [21:32:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T410589)', diff saved to https://phabricator.wikimedia.org/P86509 and previous config saved to /var/cache/conftool/dbconfig/20251210-213235-ladsgroup.json [21:32:40] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [21:33:36] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-jumbo2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:34:06] (03CR) 10JHathaway: [C:03+2] corto: add gdoc handover id for POC [puppet] - 10https://gerrit.wikimedia.org/r/1217275 (owner: 10JHathaway) [21:35:24] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-jumbo2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:36:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-jumbo2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:38:27] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti-jumbo2001.codfw.wmnet with OS trixie [21:38:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11449214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ganeti-jumb... [21:38:40] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti-jumbo2002.codfw.wmnet with OS trixie [21:38:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11449216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ganeti-jumb... [21:38:53] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti-jumbo2003.codfw.wmnet with OS trixie [21:39:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11449217 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ganeti-jumb... [21:39:10] (03Merged) 10jenkins-bot: Add experiment + tracking for mobile section switching [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216613 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [21:39:11] (03Merged) 10jenkins-bot: mobileSectionSwitch: action_context needs to be stringified [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216883 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [21:39:34] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1216613|Add experiment + tracking for mobile section switching (T410803)]], [[gerrit:1216883|mobileSectionSwitch: action_context needs to be stringified (T410803)]] [21:39:38] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [21:41:14] RECOVERY - snapshot of s5 in codfw on backupmon1001 is OK: Last snapshot for s5 at codfw (db2201) taken on 2025-12-10 20:41:10 (464 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [21:41:14] hey, it's looking happy [21:42:39] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1216613|Add experiment + tracking for mobile section switching (T410803)]], [[gerrit:1216883|mobileSectionSwitch: action_context needs to be stringified (T410803)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:44:03] !log kemayo@deploy2002 kemayo: Continuing with sync [21:44:22] glad to see :) nothing unusual on the docker registry side either [21:44:39] I'm overdue to get something to eat but I'll check back in a while if you need anything [21:44:56] rzl: thanks for covering! [21:45:04] Do eat! [21:47:45] (03PS1) 10Dzahn: aptrepo: remove expired key for certain HP repos [puppet] - 10https://gerrit.wikimedia.org/r/1217278 [21:49:15] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216613|Add experiment + tracking for mobile section switching (T410803)]], [[gerrit:1216883|mobileSectionSwitch: action_context needs to be stringified (T410803)]] (duration: 09m 40s) [21:49:18] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [21:49:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-jumbo2002.codfw.wmnet with reason: host reimage [21:49:29] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-jumbo2001.codfw.wmnet with reason: host reimage [21:49:30] Yup, no obvious problems. [21:49:39] 🎉 [21:50:08] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-jumbo2003.codfw.wmnet with reason: host reimage [21:53:48] (03CR) 10Dzahn: "2f2a30fc43e modules/aptrepo/files/updates (Filippo Giunchedi 2020-08-31 15:13:31 +0200 37) # Multiple keys in use: " [puppet] - 10https://gerrit.wikimedia.org/r/1217278 (owner: 10Dzahn) [21:54:47] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-jumbo2002.codfw.wmnet with reason: host reimage [21:58:26] (03PS2) 10Dzahn: aptrepo: remove expired key for certain HP repos [puppet] - 10https://gerrit.wikimedia.org/r/1217278 [21:58:46] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-jumbo2003.codfw.wmnet with reason: host reimage [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T2200) [22:01:37] (03CR) 10Dzahn: "[expired: 2025-12-07] - gpg: key C208ADDE26C2B797: public key "Hewlett Packard Enterprise Company RSA-2048-25 " -" [puppet] - 10https://gerrit.wikimedia.org/r/1217278 (owner: 10Dzahn) [22:02:54] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-jumbo2001.codfw.wmnet with reason: host reimage [22:10:56] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:11:12] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:11:13] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-jumbo2002.codfw.wmnet with OS trixie [22:11:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11449455 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ganeti-jumbo200... [22:13:51] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11449463 (10Dzahn) The user should also be added to LDAP groups "nda" and "wmde" like other WMDE staff. [22:14:48] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:15:36] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:15:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-jumbo2003.codfw.wmnet with OS trixie [22:15:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11449468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ganeti-jumbo200... [22:16:53] (03CR) 10JHathaway: [C:03+1] aptrepo: remove expired key for certain HP repos [puppet] - 10https://gerrit.wikimedia.org/r/1217278 (owner: 10Dzahn) [22:19:30] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:20:02] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:21:36] (03CR) 10Dzahn: [C:03+2] aptrepo: remove expired key for certain HP repos [puppet] - 10https://gerrit.wikimedia.org/r/1217278 (owner: 10Dzahn) [22:22:34] jhancock@cumin1003 reimage (PID 3400817) is awaiting input [22:22:56] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:22:57] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-jumbo2001.codfw.wmnet with OS trixie [22:23:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11449472 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ganeti-jumbo200... [22:24:54] (03PS1) 10Pppery: Logos: Destandardize thumbnail sizes, handle missing responsive URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) [22:25:02] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:25:40] (03CR) 10CI reject: [V:04-1] Logos: Destandardize thumbnail sizes, handle missing responsive URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [22:25:48] (03CR) 10Pppery: "Feel free to schedule this for deployment yourself if you rely on it to make a logo change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [22:26:34] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11449490 (10VRiley-WMF) E9 - E12 all have pingable IPs. E9 and E10 have been setup on their end, and need to add them to LibreNMS. Trying to log into E11 and E12 and they seem to be having issues t... [22:28:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11449500 (10Jhancock.wm) 05Stalled→03Resolved [22:29:13] (03PS2) 10Pppery: Logos: Destandardize thumbnail sizes, handle missing responsive URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) [22:29:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11449506 (10Jhancock.wm) @bking these are ready. I didn't run into any issues with the reimage so it's now tested. [22:40:02] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:45:02] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:46:10] (03PS1) 10Ryan Kemper: QueryServiceHighThreadCount: incr threshold [alerts] - 10https://gerrit.wikimedia.org/r/1217303 (https://phabricator.wikimedia.org/T389859) [22:48:52] (03CR) 10Bking: [C:03+2] QueryServiceHighThreadCount: incr threshold [alerts] - 10https://gerrit.wikimedia.org/r/1217303 (https://phabricator.wikimedia.org/T389859) (owner: 10Ryan Kemper) [22:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:57:45] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [22:58:09] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [23:00:02] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T2300) [23:01:11] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploying v1.1.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217246 (https://phabricator.wikimedia.org/T407805) (owner: 10Clare Ming) [23:02:50] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploying v1.1.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217246 (https://phabricator.wikimedia.org/T407805) (owner: 10Clare Ming) [23:04:10] (03CR) 10RLazarus: [C:03+2] mathoid: Upgrade to envoy-future:1.35.7 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216701 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [23:04:20] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [23:04:36] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [23:06:09] (03Merged) 10jenkins-bot: mathoid: Upgrade to envoy-future:1.35.7 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216701 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [23:07:54] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [23:08:11] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [23:10:02] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:25:54] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11449704 (10VRiley-WMF) [23:26:34] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11449715 (10VRiley-WMF) Was able to get E10 setup with LibreNMS. Still working with the other 3 [23:27:51] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [23:28:23] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [23:29:38] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply [23:30:05] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [23:32:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11449724 (10VRiley-WMF) [23:34:54] jouncebot: nowandnext [23:34:54] For the next 0 hour(s) and 25 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251210T2300) [23:34:54] In 7 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T0700) [23:34:54] In 7 hour(s) and 25 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T0700) [23:35:12] borrowing mw-debug in codfw to test the envoy upgrade [23:35:28] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [23:35:44] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [23:36:07] !log rzl@deploy2002:/srv/deployment-charts/helmfile.d/services/mw-debug$ helmfile -e codfw -i apply -l name=pinkunicorn --set mesh.image_name=envoy-future --set mesh.image_version=1.35.7-1 --context=5 # T410975 [23:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:10] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [23:40:26] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [23:40:45] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [23:40:59] !log rzl@deploy2002:/srv/deployment-charts/helmfile.d/services/mw-debug$ helmfile -e codfw -i apply -l name=pinkunicorn --context=5 # T410975 [23:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:04] done [23:41:55] (03CR) 10RLazarus: [C:03+2] {api,rest}-gateway: Update staging to Envoy 1.35.7 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216702 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [23:43:56] (03Merged) 10jenkins-bot: {api,rest}-gateway: Update staging to Envoy 1.35.7 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216702 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [23:44:18] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:44:45] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:44:47] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:45:07] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11449778 (10e75ti) Hi dear SRE folks :) //aspiring to join ranks \o/ :)// I had a look through related Wikitech pages on ops, server lifecycle and such; operations/puppet; Not... [23:46:46] (03PS1) 10E75ti: install_server: add Broadcom NIC UEFI check [puppet] - 10https://gerrit.wikimedia.org/r/1217340 (https://phabricator.wikimedia.org/T411374) [23:46:49] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-sd1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:47:04] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [23:47:19] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [23:47:26] (03CR) 10CI reject: [V:04-1] install_server: add Broadcom NIC UEFI check [puppet] - 10https://gerrit.wikimedia.org/r/1217340 (https://phabricator.wikimedia.org/T411374) (owner: 10E75ti) [23:47:44] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-sd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:47:52] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-sd1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:50:49] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [23:51:06] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply