[00:00:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:25] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy3002.esams.wmnet with OS trixie [00:06:37] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11316600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy3002.es... [00:12:17] (03PS1) 10Zabe: Initial configuration for minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199089 (https://phabricator.wikimedia.org/T408317) [00:12:58] (03PS1) 10Zabe: Initial configuration for pcmwikiqoute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408317) [00:13:23] (03PS2) 10Zabe: Initial configuration for pcmwikiqoute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408318) [00:13:51] jouncebot: nowandnext [00:13:52] No deployments scheduled for the next 1 hour(s) and 46 minute(s) [00:13:52] In 1 hour(s) and 46 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0200) [00:13:55] (03CR) 10Zabe: [C:03+2] Initial configuration for minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199089 (https://phabricator.wikimedia.org/T408317) (owner: 10Zabe) [00:14:21] (03CR) 10Zabe: [C:03+2] Initial configuration for pcmwikiqoute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408318) (owner: 10Zabe) [00:14:48] (03Merged) 10jenkins-bot: Initial configuration for minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199089 (https://phabricator.wikimedia.org/T408317) (owner: 10Zabe) [00:15:11] (03Merged) 10jenkins-bot: Initial configuration for pcmwikiqoute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408318) (owner: 10Zabe) [00:16:44] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1199090|Initial configuration for pcmwikiqoute (T408318)]], [[gerrit:1199089|Initial configuration for minwikisource (T408317)]] [00:16:53] T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318 [00:16:54] T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317 [00:20:04] (03PS1) 10Zabe: Activate minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199091 (https://phabricator.wikimedia.org/T408317) [00:20:33] (03PS1) 10Zabe: Activate pcmwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199092 (https://phabricator.wikimedia.org/T408318) [00:24:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:56] (03PS6) 10Scott French: P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) [00:34:02] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [00:37:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:39:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199093 [00:39:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199093 (owner: 10TrainBranchBot) [00:42:42] !log zabe@deploy2002 zabe: Backport for [[gerrit:1199090|Initial configuration for pcmwikiqoute (T408318)]], [[gerrit:1199089|Initial configuration for minwikisource (T408317)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:42:48] T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318 [00:42:48] T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317 [00:43:00] !log zabe@deploy2002 zabe: Continuing with sync [00:44:01] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 9.117 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:52:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:53:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.562 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:55:28] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:56:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199093 (owner: 10TrainBranchBot) [00:57:20] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199090|Initial configuration for pcmwikiqoute (T408318)]], [[gerrit:1199089|Initial configuration for minwikisource (T408317)]] (duration: 40m 37s) [00:57:26] T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318 [00:57:27] T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317 [00:58:47] (03CR) 10Zabe: [C:03+2] Activate minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199091 (https://phabricator.wikimedia.org/T408317) (owner: 10Zabe) [00:59:21] (03CR) 10Zabe: [C:03+2] Activate pcmwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199092 (https://phabricator.wikimedia.org/T408318) (owner: 10Zabe) [00:59:40] (03Merged) 10jenkins-bot: Activate minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199091 (https://phabricator.wikimedia.org/T408317) (owner: 10Zabe) [01:00:09] (03Merged) 10jenkins-bot: Activate pcmwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199092 (https://phabricator.wikimedia.org/T408318) (owner: 10Zabe) [01:00:54] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:04:01] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199096 [01:04:01] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199096 (owner: 10Zabe) [01:04:55] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199096 (owner: 10Zabe) [01:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199097 [01:08:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199097 (owner: 10TrainBranchBot) [01:14:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:14] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 19s) [01:14:28] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1199092|Activate pcmwikisource (T408318)]], [[gerrit:1199091|Activate minwikisource (T408317)]], [[gerrit:1199096|Update interwiki cache]] [01:14:34] T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318 [01:14:34] T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317 [01:15:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:16:31] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408446#11316916 (10Jclark-ctr) →14Duplicate dup:03T408359 [01:16:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11316918 (10Jclark-ctr) [01:18:42] !log zabe@deploy2002 zabe: Backport for [[gerrit:1199092|Activate pcmwikisource (T408318)]], [[gerrit:1199091|Activate minwikisource (T408317)]], [[gerrit:1199096|Update interwiki cache]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:22:32] !log zabe@deploy2002 zabe: Continuing with sync [01:23:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:23:57] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 4.728 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [01:30:52] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199097 (owner: 10TrainBranchBot) [01:32:35] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199092|Activate pcmwikisource (T408318)]], [[gerrit:1199091|Activate minwikisource (T408317)]], [[gerrit:1199096|Update interwiki cache]] (duration: 18m 07s) [01:32:41] T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318 [01:32:41] T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317 [01:33:39] zabe, pcmwikisource?? [01:33:57] no worries [01:34:03] I know its pcmwikiquote [01:34:04] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:11] its just the commit message that is wrong [01:34:13] ah, ok, good :) [01:39:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:04] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:50:37] (03PS1) 10Andrew Bogott: rabbitmq: rename config file on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1199100 (https://phabricator.wikimedia.org/T406516) [01:50:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199100 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [01:53:22] (03CR) 10Andrew Bogott: [C:03+2] rabbitmq: rename config file on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1199100 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [01:54:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:55:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0200) [02:07:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.25 [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199103 (https://phabricator.wikimedia.org/T405681) [02:07:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.25 [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199103 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [02:17:59] PROBLEM - Host cloudrabbit2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [02:19:29] RECOVERY - Host cloudrabbit2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [02:20:28] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:23:43] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.25 [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199103 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [02:24:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:25:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0300) [03:02:39] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199109 (https://phabricator.wikimedia.org/T405681) [03:02:41] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199109 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [03:03:33] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199109 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [03:04:01] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.45.0-wmf.25 refs T405681 [03:04:06] T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681 [03:14:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:15:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:20:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11317298 (10Dzahn) [03:24:01] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11317299 (10Dzahn) [03:29:06] (03PS1) 10Arlolra: ExtensionDistributor: Mark 1.45 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) [03:30:28] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:37:53] (03PS1) 10C. Scott Ananian: Forward-compatibility: allow output flags to be serialized in `OutputFlags` [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199114 (https://phabricator.wikimedia.org/T292868) [03:38:26] (03CR) 10C. Scott Ananian: [C:03+2] "Backport patch to wmf.25 which just missed the cut." [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199114 (https://phabricator.wikimedia.org/T292868) (owner: 10C. Scott Ananian) [03:39:02] (03PS1) 10C. Scott Ananian: ParserOutput: Add deprecation warnings for ParserOutput::getLanguageLinks() [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199115 [03:39:12] (03CR) 10C. Scott Ananian: [C:03+2] "Backport patch to wmf.25 which just missed the cut." [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199115 (owner: 10C. Scott Ananian) [03:39:45] (03PS1) 10C. Scott Ananian: Implement a DOM version of the DeduplicateStyles pass [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199116 (https://phabricator.wikimedia.org/T405929) [03:39:56] (03CR) 10C. Scott Ananian: [C:03+2] "Backport patch to wmf.25 which just missed the cut." [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199116 (https://phabricator.wikimedia.org/T405929) (owner: 10C. Scott Ananian) [03:44:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:51] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.45.0-wmf.25 refs T405681 (duration: 47m 50s) [03:51:55] T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681 [03:53:15] (03Merged) 10jenkins-bot: Forward-compatibility: allow output flags to be serialized in `OutputFlags` [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199114 (https://phabricator.wikimedia.org/T292868) (owner: 10C. Scott Ananian) [03:55:43] (03Merged) 10jenkins-bot: ParserOutput: Add deprecation warnings for ParserOutput::getLanguageLinks() [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199115 (owner: 10C. Scott Ananian) [03:55:47] (03Merged) 10jenkins-bot: Implement a DOM version of the DeduplicateStyles pass [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199116 (https://phabricator.wikimedia.org/T405929) (owner: 10C. Scott Ananian) [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0400) [04:02:40] !log mwpresync@deploy2002 Pruned MediaWiki: 1.45.0-wmf.22 (duration: 02m 38s) [04:29:08] (03PS1) 10C. Scott Ananian: ParserOutput: 'ParseUsedOptions' need not be present in serialized form [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199117 [04:29:49] (03CR) 10C. Scott Ananian: [C:03+2] "Pull late patch into the branch cut." [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199117 (owner: 10C. Scott Ananian) [04:30:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:26] (03PS1) 10C. Scott Ananian: Expose the list of behavior switch magic words to Parsoid [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199118 (https://phabricator.wikimedia.org/T407290) [04:39:15] (03CR) 10C. Scott Ananian: [C:03+2] "Late patch onto the train" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199118 (https://phabricator.wikimedia.org/T407290) (owner: 10C. Scott Ananian) [04:43:39] (03Merged) 10jenkins-bot: ParserOutput: 'ParseUsedOptions' need not be present in serialized form [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199117 (owner: 10C. Scott Ananian) [04:45:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:38] (03Merged) 10jenkins-bot: Expose the list of behavior switch magic words to Parsoid [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199118 (https://phabricator.wikimedia.org/T407290) (owner: 10C. Scott Ananian) [04:55:28] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:57:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:04:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:05:53] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.587 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:09:04] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:18:53] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 1.421 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:34:04] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:28] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0600). [06:03:42] (03CR) 10Krinkle: ExtensionDistributor: Mark 1.45 as beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: 10Arlolra) [06:05:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:48] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510 (10Papaul) 03NEW [06:20:28] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:43:12] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511 (10Papaul) 03NEW [06:43:42] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11317386 (10Papaul) p:05Triage→03Medium [06:43:54] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11317387 (10Papaul) p:05Triage→03Medium [06:44:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis pcmwikiquote in section s5 [06:53:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis pcmwikiquote in section s5 [06:54:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis minwikisource in section s5 [06:55:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0700). nyaa~ [07:00:05] sefehpisikler: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:32] marostegui@cumin1003 sanitize-wiki (PID 343895) is awaiting input [07:10:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis minwikisource in section s5 [07:30:28] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:43:11] !log Deploy schema change on the master x1 T407587 [07:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:15] T407587: Apply ce_event_contributions schema changes in production (x1) - https://phabricator.wikimedia.org/T407587 [07:43:35] (03PS1) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1199225 [07:44:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [07:47:54] marostegui: I'd like to create database tables in x1 for two wikis for the above config patch, can you check the command I am going to run? [07:49:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:28] jouncebot: nowandnext [07:50:28] For the next 0 hour(s) and 9 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0700) [07:50:28] In 2 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1000) [07:50:45] also, marostegui are you done deploying? [07:51:44] I'll take that as a "yes" [07:51:49] kostajh: Yeah, go for anything [07:51:53] You need :) [07:52:07] kostajh: Show me the command [07:52:52] marostegui: `php maintenance/mysql.php --cluster extension1 --wiki loginwiki ./extensions/CheckUser/schema/mysql/tables-virtual-checkuser-generated.sql` [07:53:41] kostajh: I guess that is correct I guess you'd run another one for metawiki [07:54:21] yeah [07:54:26] ok, I will try it [07:55:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:00] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11317482 (10cmooney) @papaul looks good! Nothing jumping out at me as problematic in terms of the connectivity plan. I don't think it makes sense to use 40G tho... [07:56:02] marostegui: hm, mwscript sql.php has a `--wiki` and a `--wikidb` flag [07:56:12] should I specify both as `loginwiki` ? [07:56:23] kostajh: I am not sure, I am not familiar with this procedure :( [07:56:27] just reading over `mwscript sql.php --help` [07:56:31] As we don't use it [07:56:39] (DBAs do not create tables in prod) [07:58:00] ok [07:58:10] it seems to have worked [07:58:41] I will deploy my config patch now [07:58:45] (03PS1) 10Brouberol: opensearch-operator: watch the 3 opensearch namespaces in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199226 (https://phabricator.wikimedia.org/T404874) [07:59:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [07:59:20] (03PS2) 10Brouberol: opensearch-operator: watch the 3 opensearch namespaces in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199226 (https://phabricator.wikimedia.org/T404874) [08:00:01] (03Merged) 10jenkins-bot: CheckUser: Enable SI on metawiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [08:01:04] (03CR) 10Slyngshede: [C:03+1] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1199225 (owner: 10Muehlenhoff) [08:02:10] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1199026|CheckUser: Enable SI on metawiki and loginwiki (T408428)]] [08:02:15] T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428 [08:02:40] (03CR) 10Kosta Harlan: "For next time: could you please schedule this as a backport? It was unexpected to see this when I went to deploy a config patch this morni" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199117 (owner: 10C. Scott Ananian) [08:02:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:04:16] (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1199225 (owner: 10Muehlenhoff) [08:04:24] !log jmm@dns1004 START - running authdns-update [08:05:11] !log jmm@dns1004 END - running authdns-update [08:07:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:11:12] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001 [08:11:14] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.powercycle (exit_code=99) for host ml-serve2001 [08:13:13] !log restarting blazegraph on wdqs1019 - free allocator decreasing - `sudo depool; sleep 30; sudo systemctl restart wdqs-blazegraph.service; sleep 30; sudo pool` [08:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:39] waiting on image building, which will probably take ~30 inutes [08:17:13] (03PS18) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [08:18:20] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001 [08:18:27] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.powercycle (exit_code=99) for host ml-serve2001 [08:19:22] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7480/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [08:21:56] (03PS19) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [08:23:33] (03CR) 10Brouberol: [C:03+2] opensearch-operator: watch the 3 opensearch namespaces in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199226 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [08:23:56] (03CR) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [08:24:55] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [08:25:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:26:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:27:48] (03PS7) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [08:28:07] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1199026|CheckUser: Enable SI on metawiki and loginwiki (T408428)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:28:12] T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428 [08:28:38] !log installing openjdk-11 security updates [08:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:04] RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:29:38] testing [08:29:55] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2001.codfw.wmnet [08:29:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2001.codfw.wmnet [08:33:09] !log kharlan@deploy2002 kharlan: Continuing with sync [08:34:06] (03PS1) 10Santiago Faci: xLab: Deploying v1.1.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729) [08:34:53] (03PS1) 10Brouberol: opensearch-operator: add a separator between tenant role and rolebinding resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199230 (https://phabricator.wikimedia.org/T404874) [08:35:30] (03PS2) 10Santiago Faci: xLab: Deploying v1.1.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729) [08:36:31] (03PS3) 10Santiago Faci: xLab: Deploying v1.1.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729) [08:46:15] (03PS1) 10Kosta Harlan: hCaptcha: Enable on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) [08:49:07] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199026|CheckUser: Enable SI on metawiki and loginwiki (T408428)]] (duration: 46m 57s) [08:49:16] T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428 [08:49:30] I'm going to sync another patch, unless someone else needs to deploy [08:49:36] jouncebot: nowandnext [08:49:36] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [08:49:36] In 1 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1000) [08:50:13] (03CR) 10Mszwarc: [C:03+1] hCaptcha: Enable on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [08:50:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [08:51:21] (03PS3) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) [08:51:33] (03PS8) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [08:51:38] (03Merged) 10jenkins-bot: hCaptcha: Enable on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [08:52:06] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1199231|hCaptcha: Enable on loginwiki (T408428)]] [08:53:11] (03PS9) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [08:53:38] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001 [08:53:52] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve2001 [08:54:47] (03CR) 10DCausse: [C:03+1] cirrus: Start near match A/B test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [08:55:27] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:55:28] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:56:31] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1199231|hCaptcha: Enable on loginwiki (T408428)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:56:50] T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428 [08:56:55] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [08:57:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:26] (03CR) 10Brouberol: [C:03+2] opensearch-operator: add a separator between tenant role and rolebinding resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199230 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [08:58:45] !log kharlan@deploy2002 kharlan: Continuing with sync [08:59:55] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: OpenJDK security updates - jmm@cumin2002 [08:59:58] (03PS1) 10Gehel: Hadoop: Introduce tmpreaper to cleanup /tmp [puppet] - 10https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) [09:02:01] (03CR) 10CI reject: [V:04-1] Hadoop: Introduce tmpreaper to cleanup /tmp [puppet] - 10https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:02:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:05:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:06:59] (03CR) 10Clément Goubert: [C:03+1] Route /page/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1199032 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [09:07:15] (03CR) 10Filippo Giunchedi: "> > Nice find! Yes I think that ought to work and cater for module unload too. And yes I think there shouldn't be too many modules." [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [09:08:40] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199231|hCaptcha: Enable on loginwiki (T408428)]] (duration: 16m 35s) [09:08:45] T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428 [09:14:40] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:14:44] (03CR) 10Brouberol: Hadoop: Introduce tmpreaper to cleanup /tmp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:15:50] gehel: FYI these days systemd-tmpfiles has replaced tmpreaper, check out e.g. modules/icinga/manifests/init.pp [09:20:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: OpenJDK security updates - jmm@cumin2002 [09:20:28] godog: Oh, nice! I'm too old school! [09:21:56] nice indeed, one line config file and you're done [09:22:41] (03CR) 10Elukey: [C:03+2] Use Thanos rules for Pyrra error metrics for xLab [puppet] - 10https://gerrit.wikimedia.org/r/1199023 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [09:29:06] (03Abandoned) 10Gehel: Hadoop: Introduce tmpreaper to cleanup /tmp [puppet] - 10https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:30:52] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Use hourly logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1199238 (https://phabricator.wikimedia.org/T408457) [09:30:56] (03CR) 10Elukey: LVS: Add druid-public-coordinator to service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [09:31:32] (03CR) 10Elukey: LVS: etcd data for druid-public-coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [09:34:13] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Roll-restart for Java security updates - klausman@cumin1003 [09:36:43] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [09:36:45] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11317841 (10LSobanski) Untagging #collaboration-services based on https://phabricator.wikimedia.org/T403663#11196043 [09:37:12] (03PS1) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) [09:38:07] (03CR) 10Stevemunene: LVS: Add druid-public-coordinator to service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [09:38:27] (03CR) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [09:39:32] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:39:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:47] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [09:39:54] (03PS2) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) [09:40:07] (03CR) 10Gehel: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:40:13] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:41:00] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar), 05WMF-NDA: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532 (10LSobanski) 03NEW [09:41:29] (03CR) 10FNegri: [C:03+1] P:toolforge::k8s::haproxy: Use hourly logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1199238 (https://phabricator.wikimedia.org/T408457) (owner: 10Majavah) [09:41:49] (03PS1) 10Majavah: aptrepo: Retire kubeadm/1.29 components [puppet] - 10https://gerrit.wikimedia.org/r/1199240 [09:41:50] (03PS1) 10Majavah: aptrepo: Import Kubeadm/1.31 packages [puppet] - 10https://gerrit.wikimedia.org/r/1199241 (https://phabricator.wikimedia.org/T372697) [09:41:58] (03CR) 10CI reject: [V:04-1] Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:42:05] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Use hourly logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1199238 (https://phabricator.wikimedia.org/T408457) (owner: 10Majavah) [09:42:32] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:42:54] (03PS3) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) [09:42:58] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1014.eqiad.wmnet [09:43:07] (03CR) 10Gehel: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:43:20] (03CR) 10Brouberol: Hadoop: cleanup /tmp with systemd::tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:43:35] (03CR) 10Brouberol: Hadoop: cleanup /tmp with systemd::tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:43:42] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar), 05WMF-NDA: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11317892 (10LSobanski) p:05Triage→03High [09:43:59] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:44:21] (03PS1) 10Jelto: aptrepo::staging: add job to clear incoming folder [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) [09:44:21] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar), 05WMF-NDA: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11317895 (10LSobanski) [09:44:22] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11317894 (10LSobanski) [09:44:27] (03CR) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:45:01] (03Abandoned) 10Brouberol: growthbook: remove all traces of mongoDB from the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197589 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:45:30] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:45:48] (03CR) 10Stevemunene: [C:03+1] Definition of a ferretdb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:46:25] (03CR) 10Stevemunene: [C:03+1] ferretdb-growthbook: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:48:52] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1014.eqiad.wmnet [09:49:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:13] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: allow direct access to the DB when pooling is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198974 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:49:16] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: set env vars disabling s3 security feature not implemented in radosgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198975 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:49:17] (03CR) 10Brouberol: [C:03+2] postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:49:24] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet [09:49:25] (03CR) 10Brouberol: [C:03+2] Definition of a ferretdb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:49:27] (03CR) 10Brouberol: [C:03+2] ferretdb-growthbook: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:50:11] (03PS4) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) [09:50:18] (03CR) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:50:28] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:14] (03Merged) 10jenkins-bot: cloudnative-pg-cluster: allow direct access to the DB when pooling is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198974 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:51:28] (03Merged) 10jenkins-bot: cloudnative-pg-cluster: set env vars disabling s3 security feature not implemented in radosgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198975 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:51:42] (03Merged) 10jenkins-bot: postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:51:52] (03Merged) 10jenkins-bot: Definition of a ferretdb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:51:54] (03Merged) 10jenkins-bot: ferretdb-growthbook: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:51:57] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Roll-restart for Java security updates - klausman@cumin1003 [09:52:15] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Roll-restart for Java security updates - klausman@cumin1003 [09:53:20] (03CR) 10Mark Bergsma: [C:03+1] admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [09:54:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:05] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11317933 (10mark) Approved in Gerrit! [09:54:07] (03PS2) 10Tiziano Fogli: nrpe2nodexp: use service description as alertname [puppet] - 10https://gerrit.wikimedia.org/r/1199242 (https://phabricator.wikimedia.org/T395446) [09:54:18] lookinfg at that alert [09:55:27] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet [09:55:59] (03CR) 10Brouberol: [C:03+1] Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:59:57] (03CR) 10Elukey: LVS: Add druid-public-coordinator to service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1000) [10:01:34] (03CR) 10Stevemunene: LVS: etcd data for druid-public-coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [10:02:53] (03CR) 10Clément Goubert: wikikube: Add wikikube-worker2[248-330] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [10:03:44] (03PS2) 10Jelto: aptrepo::staging: add job to clear incoming folder [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) [10:03:53] (03CR) 10Clément Goubert: [C:03+2] taskgen: Update calico IPPool check [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert) [10:05:20] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7482/co" [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: 10Jelto) [10:05:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:32] (03PS2) 10Daniel Kinzler: rest-gateway: Create metrics mapping for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183) [10:09:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:22] (03PS1) 10JavierMonton: Disable default user-agent collection. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) [10:09:37] FIRING: Failing Rate (Dashboard - Desktop & Mobile): - https://alerts.wikimedia.org/?q=alertname%3DFailing+Rate+%28Dashboard+-+Desktop+%26+Mobile%29 [10:10:00] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Roll-restart for Java security updates - klausman@cumin1003 [10:10:32] (03PS1) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [10:13:06] (03PS1) 10Huei Tan: alertmanager: route Language and Product Localization team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) [10:14:14] (03PS2) 10Huei Tan: alertmanager: route Language and Product Localization team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) [10:14:21] (03PS3) 10Huei Tan: alertmanager: route Language and Product Localization team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) [10:14:25] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407833#11318022 (10cmooney) 05Open→03Resolved I removed these additional sessions last week but got distracted and didn't come back to edi... [10:20:28] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:22:05] (03CR) 10Klausman: [C:03+1] admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [10:26:59] (03CR) 10Elukey: LVS: etcd data for druid-public-coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [10:28:47] (03CR) 10Hnowlan: [C:03+1] Route /page/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1199032 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [10:29:37] RESOLVED: Failing Rate (Dashboard - Desktop & Mobile): - https://alerts.wikimedia.org/?q=alertname%3DFailing+Rate+%28Dashboard+-+Desktop+%26+Mobile%29 [10:29:41] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group0 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198929 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:30:23] (03CR) 10Stevemunene: LVS: etcd data for druid-public-coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [10:30:51] (03CR) 10Clément Goubert: [C:03+2] Route /page/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1199032 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [10:32:14] (03CR) 10Fabfur: "as @Elukey correctly pointed out, the procedure needs to be followed here, happy to review it again later" [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [10:34:27] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [10:37:02] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11318126 (10elukey) [10:37:46] (03CR) 10Dpogorzelski: [C:03+1] admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [10:38:01] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11318132 (10elukey) We finally have all three SLO published in Pyrra: https://slo.wikimedia.org/?search=xlab Let's wait a couple of weeks to observe the new SL... [10:41:58] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group0 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198929 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:43:27] (03CR) 10Muehlenhoff: "That would work, alternative proposal inline (which doesn't interfere with people working late in the American timezones)." [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: 10Jelto) [10:44:32] (03PS1) 10Fabfur: P:cache:haproxy: don't repeat contact validation regex [puppet] - 10https://gerrit.wikimedia.org/r/1199251 (https://phabricator.wikimedia.org/T408060) [10:44:52] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [10:45:33] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198931 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:45:57] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198932 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:46:11] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198933 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:46:22] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198934 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:46:47] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198935 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:47:02] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198936 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:47:11] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198937 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:47:24] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway enwiki 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198938 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:50:03] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198930 (https://phabricator.wikimedia.org/T408223) [10:50:37] !log installing openjdk-17 security updates [10:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:07] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway enwiki 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198939 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:51:17] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway enwiki 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198940 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:51:35] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1198941 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:57:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:58:50] !log zabe@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [11:00:03] !log zabe@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [11:11:50] (03PS1) 10Stevemunene: druid: add druid-coordinator to druid public worker role [puppet] - 10https://gerrit.wikimedia.org/r/1199256 (https://phabricator.wikimedia.org/T406222) [11:14:51] (03CR) 10Mahmoud-abdelsattar: [C:03+1] Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [11:14:54] (03PS2) 10Stevemunene: druid: add druid-coordinator to druid public worker role [puppet] - 10https://gerrit.wikimedia.org/r/1199256 (https://phabricator.wikimedia.org/T406222) [11:20:08] (03PS3) 10Stevemunene: LVS: etcd data for druid-public-coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) [11:20:12] (03PS4) 10Stevemunene: LVS: Add druid-public-coordinator to service list [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) [11:21:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [11:24:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:48] (03PS1) 10Muehlenhoff: osm: Remove obsolete spec files [puppet] - 10https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) [11:29:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:29:26] (03PS10) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [11:30:32] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001 [11:31:39] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:48] I'm going to do a deployment to private code, related to Suggested Investigations [11:32:03] (03CR) 10Elukey: [C:03+1] osm: Remove obsolete spec files [puppet] - 10https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:33:55] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [11:35:59] (03CR) 10Muehlenhoff: [C:03+2] osm: Remove obsolete spec files [puppet] - 10https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:37:33] (03PS1) 10Brouberol: cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) [11:37:56] (03PS1) 10Brouberol: postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) [11:40:35] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve2001 [11:41:07] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host sretest2010 [11:42:12] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408478#11318289 (10Jclark-ctr) [11:42:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11318292 (10Jclark-ctr) →14Duplicate dup:03T408478 [11:42:50] (03PS1) 10Mvolz: Update Zotero to node22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199263 (https://phabricator.wikimedia.org/T393434) [11:42:53] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2026.codfw.wmnet [11:42:53] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2010 [11:43:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11318295 (10Jclark-ctr) 05Duplicate→03Open Closed by mistake [11:44:07] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408478#11318299 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Down due to work with card install T400877 [11:44:34] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe-codfw [11:45:40] (03CR) 10Slyngshede: [C:03+1] admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [11:47:44] (03PS1) 10Muehlenhoff: osm_sync_lag.sh: Fix default to current directory [puppet] - 10https://gerrit.wikimedia.org/r/1199265 (https://phabricator.wikimedia.org/T381565) [11:47:57] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:48:04] (03CR) 10Stevemunene: [C:03+1] postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:48:52] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [11:49:06] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:49:08] (03CR) 10Brouberol: [C:03+2] postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:49:19] (03PS2) 10Brouberol: postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) [11:50:43] (03CR) 10Brouberol: [V:03+2 C:03+2] postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:50:47] (03CR) 10Brouberol: [V:03+2 C:03+2] cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:54:33] (03PS2) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [11:54:35] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [11:54:36] fceratto@cumin1003 decommission (PID 372416) is awaiting input [11:59:27] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11318342 (10Neslihan_Turan_WMDE) Hi, sorry for the delay. I had a problem accessing Slack but now I managed to sent my public key to Amir. My public key is already... [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1200) [12:00:36] Noting that I'll finish my deployment to private code in 2-3 minutes [12:01:16] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11318344 (10Jclark-ctr) @VRiley-WMF Hey, just a heads up — the fiber was installed with RX-to-RX and TX-to-TX, so the polarity wasn’t verified. Make sure to check polarity next time to avoid c... [12:04:38] !log Deployed changes to Suggested Investigations [12:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:44] I'm finished with deploying [12:08:08] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11318379 (10cmooney) >>! In T396065#11318344, @Jclark-ctr wrote: > @cmooney link is up Ok great yep BGP looking good I've added it now. ` cmooney@ssw1-e1-eqiad> show bgp summary group core |... [12:08:51] (03PS1) 10Muehlenhoff: maps: Stop installing osm2pgsql and osmborder [puppet] - 10https://gerrit.wikimedia.org/r/1199271 (https://phabricator.wikimedia.org/T381565) [12:09:14] (03PS1) 10Cathal Mooney: ssw1-e1-eqiad: Add BGP peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1199272 (https://phabricator.wikimedia.org/T396065) [12:12:05] (03CR) 10Vgutierrez: [C:04-1] P:cache:haproxy: introduce ua classes (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [12:16:35] (03CR) 10Dpogorzelski: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [12:19:43] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198930 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [12:19:57] (03CR) 10Cathal Mooney: [C:03+2] ssw1-e1-eqiad: Add BGP peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1199272 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [12:21:15] (03Merged) 10jenkins-bot: ssw1-e1-eqiad: Add BGP peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1199272 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [12:24:09] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2026.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [12:26:28] Msz2001: is deploying a follow up [12:27:14] fceratto@cumin1003 decommission (PID 372416) is awaiting input [12:27:27] these issues appeared after the previous deploy https://logstash.wikimedia.org/goto/d13b6c9cd8e42929d855b4c081e43484 [12:35:20] Deployed [12:44:45] (03PS1) 10Stevemunene: druid: Increase the size of the Druid broker cache size to 4GB [puppet] - 10https://gerrit.wikimedia.org/r/1199280 (https://phabricator.wikimedia.org/T408189) [12:45:22] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: reboot [12:46:03] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet [12:49:18] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197#11318475 (10Jclark-ctr) a:05Jclark-ctr→03None [12:49:48] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet [12:53:07] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2026.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [12:53:07] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:53:08] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2026.codfw.wmnet [12:55:28] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:00:05] Urbanecm and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1300). [13:00:06] Bunnypranav and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe-codfw [13:01:15] hi [13:03:07] anyone deploying? [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:09] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host lvs2011.codfw.wmnet [13:06:13] (03PS5) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) [13:07:15] (03PS2) 10Muehlenhoff: Shift tile eqiad invalidation to the bookworm master [puppet] - 10https://gerrit.wikimedia.org/r/1195717 (https://phabricator.wikimedia.org/T381565) [13:08:08] (03CR) 10CDanis: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [13:08:23] (03CR) 10Gehel: [C:03+2] Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [13:10:29] (03Abandoned) 10Muehlenhoff: Shift tile eqiad invalidation to the bookworm master [puppet] - 10https://gerrit.wikimedia.org/r/1195717 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:11:13] (03CR) 10Muehlenhoff: "The mwdebug servers are gone" [puppet] - 10https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [13:11:20] (03PS2) 10Muehlenhoff: Remove obsolete appserver cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) [13:14:04] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:14:54] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:17:38] MatmaRex, I can help if you'll assist with testing :) [13:17:46] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host lvs2011.codfw.wmnet [13:17:50] Are you still around? [13:17:58] hi :) thanks [13:18:28] 10ops-codfw, 06DC-Ops, 06Traffic: lvs2011 hardware issue after reboot - https://phabricator.wikimedia.org/T408549 (10ssingh) 03NEW [13:18:29] Seems like Bunnypranav is not around [13:18:36] 10ops-codfw, 06DC-Ops, 06Traffic: lvs2011 hardware issue after reboot - https://phabricator.wikimedia.org/T408549#11318574 (10ssingh) p:05Triage→03High [13:18:37] So I'll just quickly do MatmaRex's [13:18:50] Hi! [13:19:07] Bit late, apologies. I'm fine with waiting [13:19:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199074 (https://phabricator.wikimedia.org/T408447) (owner: 10Bartosz Dziewoński) [13:20:03] bunnypranav, okay! Will signal you once I'm done, thanks! [13:20:13] Sure :) [13:20:39] (03Merged) 10jenkins-bot: Make wgVectorMaxWidthOptions specify Special:Userlogin correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199074 (https://phabricator.wikimedia.org/T408447) (owner: 10Bartosz Dziewoński) [13:21:13] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1199074|Make wgVectorMaxWidthOptions specify Special:Userlogin correctly (T408447)]] [13:21:19] T408447: Under Vector 2022 on Wikimedia wikis, page width is different between Special:UserLogin and Special:CreateAccount - https://phabricator.wikimedia.org/T408447 [13:23:23] (03PS1) 10Mszwarc: Remove hCaptcha site key from private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199291 [13:23:50] (03CR) 10Kosta Harlan: [C:03+1] Remove hCaptcha site key from private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199291 (owner: 10Mszwarc) [13:24:14] xSavitar MatmaRex we need to sync the above patch ^ [13:24:15] (03PS14) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [13:25:04] are either of you able to sync that? it should be a no-op. if not, either me or Msz2001 can do it [13:25:08] !log derick@deploy2002 derick, matmarex: Backport for [[gerrit:1199074|Make wgVectorMaxWidthOptions specify Special:Userlogin correctly (T408447)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:25:12] kostajh, sure! After bunnypranav or now? [13:25:25] MatmaRex, you can test [13:25:26] as soon as possible, I'd say [13:25:49] my change looks good [13:25:53] Okay, once MatmaRex is done testing, maybe you can take over before bunnypranav (just an idea). That is if bunnypranav is up for it. [13:26:05] MatmaRex, okay will sync now. [13:26:06] I'm fine, can wait if needed. [13:26:12] !log derick@deploy2002 derick, matmarex: Continuing with sync [13:26:38] kostajh, okay bunnypranav agrees. I'll poke you once MatmaRex's patch is done syncing. [13:27:39] kostajh, I can also help in doing it. [13:28:22] (03CR) 10Ottomata: Disable default user-agent collection. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: 10JavierMonton) [13:29:02] thank you! [13:29:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [13:29:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [13:29:39] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [13:29:46] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [13:29:49] (03PS15) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [13:29:49] (03CR) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [13:32:10] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199074|Make wgVectorMaxWidthOptions specify Special:Userlogin correctly (T408447)]] (duration: 10m 56s) [13:32:14] T408447: Under Vector 2022 on Wikimedia wikis, page width is different between Special:UserLogin and Special:CreateAccount - https://phabricator.wikimedia.org/T408447 [13:33:05] (03CR) 10Muehlenhoff: "Looks good to me!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff) [13:33:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199291 (owner: 10Mszwarc) [13:33:36] kostajh, so nothing to test I suppose? [13:33:45] xSavitar: nothing to test [13:33:57] Ack! Will just sync it when it's time then, thanks~ [13:34:01] *! [13:34:16] (03Merged) 10jenkins-bot: Remove hCaptcha site key from private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199291 (owner: 10Mszwarc) [13:34:48] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1199291|Remove hCaptcha site key from private/readme.php]] [13:35:35] thanks for deploying xSavitar [13:35:59] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11318699 (10LSobanski) [13:36:22] MatmaRex, thank you :) [13:38:53] !log derick@deploy2002 mszwarc, derick: Backport for [[gerrit:1199291|Remove hCaptcha site key from private/readme.php]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:39:16] !log derick@deploy2002 mszwarc, derick: Continuing with sync [13:39:42] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11318700 (10Papaul) @cmooney thanks for the feedback, I will upgrade the diagram to match the 100G links between the core routers and the switches and the type of... [13:42:43] bunnypranav, 64% done, will hand over to you in a few mins. [13:42:56] sure! [13:43:46] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199291|Remove hCaptcha site key from private/readme.php]] (duration: 08m 58s) [13:43:55] bunnypranav over to you. [13:44:18] and thank you for your patience. 🙏🏽 [13:44:27] No worries [13:45:21] I need some help of yours as well, the patch is a creation of an namespace; do we need to run any maintenance scripts [13:46:17] btw, the namespace is "R:", and they already use that prefix, technically in the mainspace, so i assume the former. [13:46:25] xSavitar: ^^^ [13:46:38] bunnypranav: run namespacedupes [13:46:49] anzx beat me to it. [13:47:23] I assume the pages wont be lost right? [13:49:30] (03PS2) 10JavierMonton: Disable default user-agent collection. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) [13:49:32] bunnypranav, I think everything should be fine. [13:49:36] bunnypranav: https://www.mediawiki.org/wiki/Manual:NamespaceDupes.php add prefix to check of any pages lost/unmoved/need manually moved can be retrieved [13:49:53] Are there any pages that are already in that namespace? In the past? [13:50:12] I guess I shouldn't say namespace but prefixed by R: [13:50:28] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:37] After running that script, everything should work correctly and they should be part of the R: and R_talk: namespace I suppose. [13:51:14] Okay! [13:51:19] * xSavitar runs for a meeting... [13:51:28] xSavitar: BTW I need you to deploy it for me, I am just a volunteer. [13:51:57] (03CR) 10Giuseppe Lavagetto: "I think the patch goes in the right direction, but is overcomplicated and misses a couple things:" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [13:52:12] bunnypranav, Oh I could do that but having a meeting now. Will you be fine doing the next backport window? That is if another deployer isn't around to help. [13:52:15] (03CR) 10JavierMonton: Disable default user-agent collection. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: 10JavierMonton) [13:52:27] I thought you would be the one deploying, apologies, I would have asked. [13:52:31] The next window is 1:30 am for me [13:52:49] Its fine [13:53:22] Ops :(, I'll ping you here in a few hours (later this evening). If there is an open window, we can deploy your patch. [13:53:39] Otherwise, we can do it tomorrow afternoon (that's when I'll be available). [13:53:54] Is that okay by you? [13:54:21] (03CR) 10Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [13:54:28] Fine, I'll see if I am available tomorrow. [13:54:46] These deploy windows are pretty tough for asian timezones [13:55:10] bunnypranav, FYI - this is the docs for adding a new namespace: https://wikitech.wikimedia.org/wiki/Adding_namespaces [13:55:15] I hope it's still up to date. [13:55:19] Can I ping you in a few hours once I am available as well? [13:55:34] bunnypranav, yes ping me please. I want to help. [13:55:48] Thank you so much! [13:56:01] bunnypranav, no thank you for all the work. 🙏🏽 [13:56:12] :D [13:56:31] Re tz friendlyness, maybe you can ask on #wikimedia-releng about it. [13:56:52] But we have multiple of these windows per day so I'm pretty sure one is friendly I suppose to your TZ [13:57:11] * xSavitar goes AFK to attend a meeting. [13:57:28] Checked the wikitech page earlier, commit is fine; just needed confirmation on the maintenence scripts [13:58:07] yeah, the afternoon one was fine, today I was busy for the morning one, so couldn't schedule for it. [14:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1400) [14:01:51] (03CR) 10Elukey: [C:03+1] osm_sync_lag.sh: Fix default to current directory [puppet] - 10https://gerrit.wikimedia.org/r/1199265 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:02:12] (03CR) 10Elukey: [C:03+1] maps: Stop installing osm2pgsql and osmborder [puppet] - 10https://gerrit.wikimedia.org/r/1199271 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:02:41] (03CR) 10Elukey: [C:03+1] LVS: etcd data for druid-public-coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:02:58] (03CR) 10Elukey: [C:03+1] LVS: Add druid-public-coordinator to service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:03:12] (03CR) 10Elukey: [C:03+1] druid: add druid-coordinator to druid public worker role [puppet] - 10https://gerrit.wikimedia.org/r/1199256 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:05:32] (03PS16) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [14:05:51] (03PS1) 10Brouberol: global_config: add an urldownloader external service [puppet] - 10https://gerrit.wikimedia.org/r/1199297 (https://phabricator.wikimedia.org/T408012) [14:09:58] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199297 (https://phabricator.wikimedia.org/T408012) (owner: 10Brouberol) [14:10:46] (03PS5) 10Daniel Kinzler: api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) [14:10:58] (03CR) 10CI reject: [V:04-1] api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [14:13:00] (03PS1) 10Federico Ceratto: sanitize-wiki: log into phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1199301 (https://phabricator.wikimedia.org/T408512) [14:14:10] (03PS1) 10Muehlenhoff: Update account meta data for khantstop [puppet] - 10https://gerrit.wikimedia.org/r/1199302 [14:14:48] (03CR) 10Ottomata: [C:03+1] "I didn't look very deep to check each config, but LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: 10JavierMonton) [14:17:08] (03PS17) 10Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [14:19:50] (03PS18) 10Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [14:19:50] (03PS7) 10Clément Goubert: api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [14:20:06] (03PS8) 10Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) [14:20:28] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:21:24] (03CR) 10Kamila Součková: [C:03+2] admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [14:23:12] (03PS7) 10Clément Goubert: api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [14:23:15] (03CR) 10Clare Ming: [C:03+2] xLab: Deploying v1.1.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729) (owner: 10Santiago Faci) [14:24:26] (03CR) 10Jasmine: wikikube: Add wikikube-worker2[248-330] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [14:24:53] (03Merged) 10jenkins-bot: xLab: Deploying v1.1.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729) (owner: 10Santiago Faci) [14:26:09] (03PS1) 10Majavah: toolforge::toolviews: Output proper Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) [14:26:39] (03CR) 10Andrew Bogott: [C:03+1] clean-stale-puppet-certs: Remove nodes from PuppetDB where enabled [puppet] - 10https://gerrit.wikimedia.org/r/1198299 (owner: 10Majavah) [14:27:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: lvs2011 hardware issue after reboot - https://phabricator.wikimedia.org/T408549#11318894 (10Jhancock.wm) logged into idrac and found following error. ` A critical diagnostic event occurred in the memory device at B2. Contact your service provider for assistance in... [14:27:56] (03CR) 10Kamila Součková: [C:03+1] "LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [14:28:19] (03CR) 10CI reject: [V:04-1] toolforge::toolviews: Output proper Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) (owner: 10Majavah) [14:28:33] (03PS20) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [14:29:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11318896 (10Raine) [14:29:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: 10JavierMonton) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1430) [14:30:18] (03PS19) 10Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [14:30:58] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11318918 (10tappof) I saw the alerts on the ALERTS metric: https://w.wiki/FqSi . I think there was a silence rule in place, so you didn't get any notifications.... [14:31:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: lvs2011 hardware issue after reboot - https://phabricator.wikimedia.org/T408549#11318932 (10ssingh) 05Open→03Resolved a:03ssingh Thanks for the help @Jhancock.wm. Marking this as resolved for now. [14:32:33] (03PS9) 10Clément Goubert: api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [14:33:26] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11318939 (10dr0ptp4kt) >>! In T398869#11318126, @elukey wrote: > We finally have all three SLO published in Pyrra: https://slo.wikimedia.org/?search=xlab Thank... [14:33:50] (03PS9) 10Clément Goubert: api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [14:35:17] (03CR) 10CI reject: [V:04-1] api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [14:36:02] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:37:38] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11318965 (10elukey) Ran the diff testing tool between eqiad and codfw: ` | | ssim | |-----:|---------:| | 0.05 | 0.974994 | | 0.1 | 0.990161 | | 0.2 | 0.998943 | | 0.25 |... [14:37:46] (03PS1) 10Brouberol: growthbook: deploy a more modern version against ferretdb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199310 (https://phabricator.wikimedia.org/T408397) [14:39:48] (03PS1) 10Federico Ceratto: site.pp, es2026.yaml: Decommission es2026 [puppet] - 10https://gerrit.wikimedia.org/r/1199311 (https://phabricator.wikimedia.org/T408385) [14:40:48] jouncebot: nowandnext [14:40:48] For the next 0 hour(s) and 19 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1430) [14:40:48] In 0 hour(s) and 19 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1500) [14:41:36] I am restarting both CI Jenkins and Gerrit [14:42:07] !log Restarting Gerrit [14:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:46] (03CR) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [14:45:08] !log Restarted CI Jenkins [14:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:44] (03CR) 10Majavah: [C:03+2] clean-stale-puppet-certs: Remove nodes from PuppetDB where enabled [puppet] - 10https://gerrit.wikimedia.org/r/1198299 (owner: 10Majavah) [14:45:45] Gerrit/Jenkins/Zuul are all up and running [14:46:02] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 9.007 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:46:34] (03CR) 10Andrea Denisse: [C:03+1] "lgtm, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) (owner: 10Huei Tan) [14:46:59] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11319051 (10cmooney) >>! In T408378#11318918, @tappof wrote: > I saw the alerts on the ALERTS metric: https://w.wiki/FqSi . Ok thanks for that! That is a good... [14:47:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11319065 (10RobH) [14:47:42] (03PS1) 10Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) [14:48:29] (03PS2) 10Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) [14:49:22] (03CR) 10Clément Goubert: "Due to rebasing issues, I've squashed all the patch stack for the next phase of testing in one, plus renaming group to policy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: 10Clément Goubert) [14:49:56] (03CR) 10CI reject: [V:04-1] api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: 10Clément Goubert) [14:50:02] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:50:54] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30036 bytes in 0.463 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:51:04] (03PS3) 10Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) [14:51:36] (03PS1) 10Cathal Mooney: team-netops: ospf alert: add pint disable promql/series [alerts] - 10https://gerrit.wikimedia.org/r/1199332 (https://phabricator.wikimedia.org/T408378) [14:52:06] (03CR) 10Pmiazga: api-gateway: Release patch for ratelimit test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: 10Clément Goubert) [14:52:32] (03CR) 10CI reject: [V:04-1] api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: 10Clément Goubert) [14:52:33] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [14:52:37] (03PS2) 10Majavah: toolforge::toolviews: Output proper Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) [14:52:37] (03PS1) 10Majavah: toolforge::toolviews: Fix footgun with default values [puppet] - 10https://gerrit.wikimedia.org/r/1199333 [14:54:03] (03PS1) 10Gehel: hadoop: cleanup /tmp from directories as well as files [puppet] - 10https://gerrit.wikimedia.org/r/1199334 (https://phabricator.wikimedia.org/T396582) [14:55:01] (03PS3) 10Cwhite: site: initial setup for new logging-sd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) [14:55:07] (03CR) 10CI reject: [V:04-1] toolforge::toolviews: Output proper Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) (owner: 10Majavah) [14:56:33] (03PS4) 10Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) [14:57:38] !log dancy@deploy2002 Installing scap version "4.218.0" for 2 host(s) [14:57:57] (03CR) 10Clément Goubert: api-gateway: Release patch for ratelimit test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: 10Clément Goubert) [14:57:58] (03CR) 10CI reject: [V:04-1] api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: 10Clément Goubert) [14:58:11] (03CR) 10FNegri: [C:03+1] toolforge::toolviews: Fix footgun with default values [puppet] - 10https://gerrit.wikimedia.org/r/1199333 (owner: 10Majavah) [14:59:11] (03PS2) 10Majavah: toolforge::toolviews: Fix footgun with default values [puppet] - 10https://gerrit.wikimedia.org/r/1199333 [14:59:24] !log dancy@deploy2002 Installation of scap version "4.218.0" completed for 2 hosts [14:59:59] 06SRE, 10SRE-Access-Requests: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11319159 (10Raine) 05Open→03Resolved Done, ping me in case of trouble :-) [15:00:05] jelto, arnoldokoth, and mutante: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1500). [15:00:31] no my calendar says it's in one hour [15:00:41] daylight confusion time [15:01:05] (03PS21) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [15:01:46] (03CR) 10Majavah: [C:03+2] toolforge::toolviews: Fix footgun with default values [puppet] - 10https://gerrit.wikimedia.org/r/1199333 (owner: 10Majavah) [15:02:31] (03PS3) 10Majavah: toolforge::toolviews: Output proper Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) [15:04:14] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences, 06Reader Experience Team (REx Sprint 8 [Q2 Oct 21-Nov 3]): [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11319191 (10Jdrewniak) When I talked to #traffic about this topic... [15:04:35] (03CR) 10CI reject: [V:04-1] toolforge::toolviews: Output proper Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) (owner: 10Majavah) [15:05:42] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1004.eqiad.wmnet with reason: reboot for kernel [15:06:00] (03PS4) 10Majavah: toolforge::toolviews: Output proper Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) [15:06:19] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab2002.codfw.wmnet with reason: reboot for kernel [15:06:34] (03PS22) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [15:07:20] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11319213 (10elukey) @TheDJ Hi! As FYI we now have eqiad and codfw on the new stack, both eqiad and codfw are pooled :) [15:07:23] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7486/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [15:09:04] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:07] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:09:11] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:09:19] !log brennen@deploy2002 Started deploy [phabricator/deployment@5fbb350]: deploy phab1004 for T408575 [15:09:21] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [15:09:24] T408575: Deploy Phabricator/Phorge 2025-10-28 - https://phabricator.wikimedia.org/T408575 [15:09:25] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [15:09:53] !log brennen@deploy2002 Finished deploy [phabricator/deployment@5fbb350]: deploy phab1004 for T408575 (duration: 00m 34s) [15:10:12] !log brennen@deploy2002 Started deploy [phabricator/deployment@5fbb350]: deploy phab1004 for T408575 [15:11:37] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11319244 (10elukey) [15:11:41] !log applied mediawiki-common network policy updates in mw-script / mw-cron - T309738 [15:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:52] T309738: Move MediaWiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 [15:12:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11319246 (10Ladsgroup) >>! In T406590#11318342, @Neslihan_Turan_WMDE wrote: > Hi, sorry for the delay. I had a problem accessing Slack but now I managed to sent my... [15:12:22] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11319258 (10elukey) The host is up after a powercycle, but it is still not serving any traffic. Adding dcops if they want to investigate it further, giving the numerous occurrences of t... [15:13:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11319262 (10Ladsgroup) I confirmed the key out of band. [15:13:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11319266 (10Ladsgroup) [15:14:01] (03PS1) 10Ottomata: AQS edit-analytics - deploy new edits/per_editor endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199337 (https://phabricator.wikimedia.org/T405041) [15:16:21] !log brennen@deploy2002 Finished deploy [phabricator/deployment@5fbb350]: deploy phab1004 for T408575 (duration: 06m 09s) [15:16:33] T408575: Deploy Phabricator/Phorge 2025-10-28 - https://phabricator.wikimedia.org/T408575 [15:16:56] (03PS5) 10Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) [15:19:26] (03CR) 10Elukey: [C:03+1] Nokia: always set system cpm packet filter on devices [homer/public] - 10https://gerrit.wikimedia.org/r/1199056 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [15:20:05] (03CR) 10Brouberol: [C:03+1] druid: Increase the size of the Druid broker cache size to 4GB [puppet] - 10https://gerrit.wikimedia.org/r/1199280 (https://phabricator.wikimedia.org/T408189) (owner: 10Stevemunene) [15:21:39] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11319327 (10Jhancock.wm) @elukey is it depooled? i wanna check some things out that might require some reboots. [15:23:36] !log disable-puppet on A:cp hosts for haproxy config change [15:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:02] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [15:24:02] (03CR) 10Stevemunene: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199310 (https://phabricator.wikimedia.org/T408397) (owner: 10Brouberol) [15:24:16] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11319338 (10elukey) @Jhancock.wm yep you can go ahead! Thanks :) [15:24:33] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1193276 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [15:24:36] (03CR) 10Scott French: [C:03+2] P:cache::haproxy: move x_requestctl setup into listen section [puppet] - 10https://gerrit.wikimedia.org/r/1193276 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [15:24:56] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 2.732 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [15:25:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:27:27] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11319349 (10MoritzMuehlenhoff) [15:27:29] (03PS6) 10Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) [15:27:55] (03Abandoned) 10Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [15:28:05] (03Abandoned) 10Clément Goubert: api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [15:28:11] (03Abandoned) 10Clément Goubert: api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [15:29:37] (03CR) 10CDanis: [C:03+1] "+1 from me! Although I don't think it's strictly necessary to make the same change on the public druid IMO" [puppet] - 10https://gerrit.wikimedia.org/r/1199280 (https://phabricator.wikimedia.org/T408189) (owner: 10Stevemunene) [15:34:04] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:52] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ml-serve2001'] [15:34:53] (03PS2) 10Arlolra: ExtensionDistributor: Mark 1.45 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) [15:35:13] (03CR) 10Herron: [C:03+1] alertmanager: Add support for team mentions on the Slack template [puppet] - 10https://gerrit.wikimedia.org/r/1194321 (https://phabricator.wikimedia.org/T408145) (owner: 10Andrea Denisse) [15:36:36] (03CR) 10Ottomata: [C:03+2] "Main patch has been reviewed, merging for deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199337 (https://phabricator.wikimedia.org/T405041) (owner: 10Ottomata) [15:36:49] (03CR) 10Herron: [C:03+1] nrpe2nodexp: use service description as alertname [puppet] - 10https://gerrit.wikimedia.org/r/1199242 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [15:37:56] (03CR) 10Arlolra: ExtensionDistributor: Mark 1.45 as beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: 10Arlolra) [15:38:23] (03Merged) 10jenkins-bot: AQS edit-analytics - deploy new edits/per_editor endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199337 (https://phabricator.wikimedia.org/T405041) (owner: 10Ottomata) [15:41:52] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:43:49] !log rolling run-puppet-agent on A:cp hosts for haproxy config change [15:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:52] (03PS1) 10Kamila Součková: benthos-cache-invalidator: clean up releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199340 [15:44:55] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ml-serve2001'] [15:46:22] 10ops-eqiad, 06DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408585 (10phaultfinder) 03NEW [15:46:39] (03CR) 10CI reject: [V:04-1] benthos-cache-invalidator: clean up releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199340 (owner: 10Kamila Součková) [15:49:29] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199403 [15:51:12] (03PS3) 10Ebernhardson: cirrus: Start near match A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) [15:51:12] (03CR) 10Ebernhardson: cirrus: Start near match A/B test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [15:51:58] (03CR) 10CI reject: [V:04-1] cirrus: Start near match A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [15:54:09] FIRING: HelmReleaseBadStatus: Helm release edit-analytics/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=edit-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:54:31] (03PS23) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [15:58:29] (03CR) 10Cathal Mooney: [C:03+2] Nokia: always set system cpm packet filter on devices [homer/public] - 10https://gerrit.wikimedia.org/r/1199056 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [15:59:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [15:59:17] (03PS4) 10Ebernhardson: cirrus: Start near match A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) [16:00:00] (03Merged) 10jenkins-bot: Nokia: always set system cpm packet filter on devices [homer/public] - 10https://gerrit.wikimedia.org/r/1199056 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [16:00:04] jhathaway and moritzm: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:04:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [16:05:20] 10ops-magru, 06SRE, 06DC-Ops: MAGRU power maint - CHG0262056 - October 29-30, 2025 - https://phabricator.wikimedia.org/T408589 (10RobH) 03NEW p:05Triage→03Low [16:05:50] 10ops-magru, 06SRE, 06DC-Ops: MAGRU power maint - CHG0262056 - October 29-30, 2025 - https://phabricator.wikimedia.org/T408589#11319581 (10RobH) Please note the email required we give consent for the work so I did so via the email. [16:06:52] 10ops-magru, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: MAGRU power maint - CHG0262056 - October 29-30, 2025 - https://phabricator.wikimedia.org/T408589#11319592 (10RobH) @netops & #traffic: I don't expect any impact from this according to the notification but just FYI! [16:13:53] 06SRE, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592 (10Jdrewniak) 03NEW [16:14:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:23] 06SRE, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11319644 (10Jdrewniak) [16:15:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:08] (03CR) 10Marostegui: sanitize-wiki: log into phabricator (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1199301 (https://phabricator.wikimedia.org/T408512) (owner: 10Federico Ceratto) [16:30:56] (03PS1) 10Marostegui: instances.yaml: Remove es1031 [puppet] - 10https://gerrit.wikimedia.org/r/1199462 (https://phabricator.wikimedia.org/T408600) [16:31:37] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1031 [puppet] - 10https://gerrit.wikimedia.org/r/1199462 (https://phabricator.wikimedia.org/T408600) (owner: 10Marostegui) [16:32:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1031 from dbctl T408600', diff saved to https://phabricator.wikimedia.org/P84315 and previous config saved to /var/cache/conftool/dbconfig/20251028-163252-marostegui.json [16:32:59] T408600: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600 [16:34:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:18] (03PS1) 10Marostegui: mariadb: Decommission es1031 [puppet] - 10https://gerrit.wikimedia.org/r/1199463 (https://phabricator.wikimedia.org/T408600) [16:34:49] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1031.eqiad.wmnet [16:35:08] (03PS1) 10Ottomata: edit-analytics - bump to build on bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199464 (https://phabricator.wikimedia.org/T405041) [16:35:14] (03PS1) 10Elukey: prometheus-amd-rocm: fix exporter for ROCm 7.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1199465 (https://phabricator.wikimedia.org/T403697) [16:35:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:35:34] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission es1031 [puppet] - 10https://gerrit.wikimedia.org/r/1199463 (https://phabricator.wikimedia.org/T408600) (owner: 10Marostegui) [16:36:00] (03CR) 10Marostegui: "is it already removed from dbctl?" [puppet] - 10https://gerrit.wikimedia.org/r/1199311 (https://phabricator.wikimedia.org/T408385) (owner: 10Federico Ceratto) [16:36:06] (03CR) 10Ottomata: [C:03+2] edit-analytics - bump to build on bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199464 (https://phabricator.wikimedia.org/T405041) (owner: 10Ottomata) [16:36:08] (03PS1) 10Mszwarc: hCaptcha: Store risk score in cache, so that jobs can use it [extensions/ConfirmEdit] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199466 (https://phabricator.wikimedia.org/T408542) [16:36:27] (03PS1) 10Mszwarc: hCaptcha: Store risk score in cache, so that jobs can use it [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199467 (https://phabricator.wikimedia.org/T408542) [16:36:29] (03PS2) 10Elukey: prometheus-amd-rocm: fix exporter for ROCm 7.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1199465 (https://phabricator.wikimedia.org/T403697) [16:37:45] (03Merged) 10jenkins-bot: edit-analytics - bump to build on bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199464 (https://phabricator.wikimedia.org/T405041) (owner: 10Ottomata) [16:38:29] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [16:38:34] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [16:40:43] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [16:40:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:41:03] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [16:41:18] !incidents [16:41:18] 6905 (UNACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [16:41:24] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [16:41:24] <_joe_> sukhe: hi [16:41:29] !ack 6905 [16:41:30] <_joe_> !ack 6905 [16:41:30] !ack 6905 [16:41:32] 6905 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [16:41:33] 6905 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [16:41:33] 6905 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [16:44:09] RESOLVED: HelmReleaseBadStatus: Helm release edit-analytics/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=edit-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:44:17] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1031.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [16:44:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1031.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [16:44:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:44:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1031.eqiad.wmnet [16:44:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199466 (https://phabricator.wikimedia.org/T408542) (owner: 10Mszwarc) [16:45:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199467 (https://phabricator.wikimedia.org/T408542) (owner: 10Mszwarc) [16:45:58] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:49:36] (03CR) 10Brouberol: [C:03+2] growthbook: deploy a more modern version against ferretdb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199310 (https://phabricator.wikimedia.org/T408397) (owner: 10Brouberol) [16:50:39] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600#11320015 (10Marostegui) [16:50:50] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600#11320042 (10Marostegui) This is ready for #dc-ops [16:51:10] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [16:51:22] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [16:51:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [16:51:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [16:51:57] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [16:52:17] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [16:52:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:52:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:52:39] (03PS1) 10Pppery: Update translation [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1199469 [16:53:11] (03PS2) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1199469 [16:53:42] (03PS3) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1199469 [16:55:28] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:00:05] swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC late) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1700). [17:00:13] o/ [17:00:26] <_joe_> jouncebot: cringe [17:00:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:40] lol [17:01:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199048 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:01:35] (03PS4) 10JHathaway: sysctls: add optional module param to sysctl::parameters [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) [17:02:23] (03Merged) 10jenkins-bot: Enroll 10% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199048 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:02:56] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1199048|Enroll 10% of client sessions in PHP 8.3 (T405955)]] [17:03:06] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:05:09] (03CR) 10JHathaway: "I wasn't aware of ConditionKernelModuleLoaded. I tried it on a qemu sid box, but I couldn't get it to work properly. I think this is becau" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [17:05:13] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [17:05:23] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1199048|Enroll 10% of client sessions in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:05:47] (03CR) 10Klausman: [C:03+1] prometheus-amd-rocm: fix exporter for ROCm 7.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1199465 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [17:07:09] !log swfrench@deploy2002 swfrench: Continuing with sync [17:08:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:08:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool sretest2003 T407352', diff saved to https://phabricator.wikimedia.org/P84316 and previous config saved to /var/cache/conftool/dbconfig/20251028-170840-marostegui.json [17:08:46] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [17:09:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:04] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es2040 to clone sretest2003 T407352', diff saved to https://phabricator.wikimedia.org/P84317 and previous config saved to /var/cache/conftool/dbconfig/20251028-170958-marostegui.json [17:11:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2040.codfw.wmnet,sretest2003.codfw.wmnet with reason: Cloning sretest2003 from es2040 [17:11:27] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199048|Enroll 10% of client sessions in PHP 8.3 (T405955)]] (duration: 08m 30s) [17:11:31] (03PS1) 10Marostegui: sretest2003: Move it to es7 [puppet] - 10https://gerrit.wikimedia.org/r/1199472 (https://phabricator.wikimedia.org/T407352) [17:11:32] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:12:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto sretest2003.codfw.wmnet [17:13:01] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences, 06Reader Experience Team (REx Sprint 8 [Q2 Oct 21-Nov 3]): [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11320228 (10CDanis) Sounds good to me @Jdrewniak ! Thanks :) [17:13:11] (03CR) 10Marostegui: [C:03+2] sretest2003: Move it to es7 [puppet] - 10https://gerrit.wikimedia.org/r/1199472 (https://phabricator.wikimedia.org/T407352) (owner: 10Marostegui) [17:13:29] part #1 of the infra window done. part #2 coming soon. [17:13:44] (03PS3) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [17:13:54] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320235 (10Dzahn) a:05Neslihan_Turan_WMDE→03None Thank you for taking care of that, Ladsgroup! [17:14:02] (03CR) 10Fabfur: [C:04-1] "still addressing the comments" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [17:14:06] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320237 (10Dzahn) 05Stalled→03In progress [17:14:51] (03CR) 10Andrea Denisse: [C:03+2] alertmanager: Add support for team mentions on the Slack template [puppet] - 10https://gerrit.wikimedia.org/r/1194321 (https://phabricator.wikimedia.org/T408145) (owner: 10Andrea Denisse) [17:18:54] (03CR) 10Scott French: [C:03+2] mw-(api-int|jobrunner): Serve 5% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199047 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:20:45] (03Merged) 10jenkins-bot: mw-(api-int|jobrunner): Serve 5% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199047 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:22:53] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:23:08] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:23:29] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:23:38] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:25:27] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:25:40] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:25:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11320304 (10RobH) [[ https://docs.google.com/spreadsheets/d/13ow4JxrsQdz8KSsdBBNwvlrAuGKo8OHWcnR4RhXTYc0/edit?usp=sharing | Google Sheet listing of all affect... [17:26:03] (03PS3) 10Elukey: prometheus-amd-rocm: fix exporter for ROCm 7.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1199465 (https://phabricator.wikimedia.org/T403697) [17:26:23] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:26:31] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:27:25] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:27:37] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:27:58] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:28:04] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:28:37] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:28:46] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:28:56] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:29:01] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:31:42] (03PS5) 10BCornwall: varnish: Promote new m-dot redirect from 302/307 to 301/308 [puppet] - 10https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [17:32:30] (03CR) 10BCornwall: "I took the liberty to update two more tests to use 301s instead of 302s. varnishtests now pass. Mind giving that a lookover?" [puppet] - 10https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [17:33:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool es2027 T408406', diff saved to https://phabricator.wikimedia.org/P84318 and previous config saved to /var/cache/conftool/dbconfig/20251028-173348-fceratto.json [17:33:53] T408406: decommission es2027 - https://phabricator.wikimedia.org/T408406 [17:38:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11320374 (10RobH) [17:38:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11320386 (10RobH) [17:40:28] (03CR) 10BCornwall: "Marking unresolved" [puppet] - 10https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [17:46:12] (03PS2) 10Federico Ceratto: site.pp, es2026.yaml: Decommission es2026 [puppet] - 10https://gerrit.wikimedia.org/r/1199311 (https://phabricator.wikimedia.org/T408385) [17:46:12] (03PS1) 10Federico Ceratto: instances.yaml: remove es2027 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199476 (https://phabricator.wikimedia.org/T408406) [17:52:41] 06SRE, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11320455 (10Aklapper) > - The ability to access this page via a custom domain/subdomain (TBD) Wasn't that {T407156} instead of TBD? [17:57:24] (03PS1) 10Ottomata: edit-analytics - image bump to fix path route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199479 (https://phabricator.wikimedia.org/T405041) [17:57:36] (03CR) 10Scott French: [C:03+1] {api,rest}-gateway: Update to Envoy 1.32.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199085 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [17:57:38] (03CR) 10Ottomata: [C:03+2] edit-analytics - image bump to fix path route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199479 (https://phabricator.wikimedia.org/T405041) (owner: 10Ottomata) [17:59:19] (03Merged) 10jenkins-bot: edit-analytics - image bump to fix path route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199479 (https://phabricator.wikimedia.org/T405041) (owner: 10Ottomata) [18:00:05] dduvall and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1800). [18:01:28] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [18:01:46] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [18:02:06] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [18:02:21] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [18:02:31] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [18:03:02] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [18:04:24] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:06:45] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199481 (https://phabricator.wikimedia.org/T405681) [18:06:52] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199481 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [18:07:42] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199481 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [18:08:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:10:54] (03PS1) 10Jdlrobson: Update QuickSurvey platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 [18:13:44] (03PS2) 10Federico Ceratto: sanitize-wiki: log into phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1199301 (https://phabricator.wikimedia.org/T408512) [18:14:43] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.25 refs T405681 [18:14:48] T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681 [18:17:44] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320544 (10Dzahn) [18:21:36] (03PS1) 10Dzahn: admin: add SSH key and restricted group membership for neslihanturan [puppet] - 10https://gerrit.wikimedia.org/r/1199484 (https://phabricator.wikimedia.org/T406590) [18:23:19] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:24:24] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:27:33] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11320568 (10Dzahn) [18:27:56] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11320570 (10Dzahn) added tag for the SRE subteam that owns microsites hosted on "miscweb" / kubernetes [18:28:19] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:48] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11320577 (10Dzahn) This is certainly possible (hosting on kubernetes 'miscweb' alongside other microsites) and deployment via deployment servers, but does require... [18:32:33] (03CR) 10Dzahn: aptrepo::staging: add job to clear incoming folder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: 10Jelto) [18:33:04] (03CR) 10Krinkle: [C:03+1] "Thanks. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [18:33:19] (03PS8) 10Krinkle: varnish: Remove temporary enable_m_redir flag [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) [18:35:09] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS trixie [18:37:39] (03PS1) 10Dzahn: add discovery records for gerrit as CNAMEs to public names [dns] - 10https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) [18:39:09] (03CR) 10Dzahn: "Is this what you meant?" [dns] - 10https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [18:43:56] (03PS2) 10Dzahn: add discovery records for gerrit as CNAMEs to public names [dns] - 10https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) [18:49:51] (03CR) 10Kamila Součková: [C:03+1] admin: add SSH key and restricted group membership for neslihanturan [puppet] - 10https://gerrit.wikimedia.org/r/1199484 (https://phabricator.wikimedia.org/T406590) (owner: 10Dzahn) [18:50:28] (03CR) 10Pmiazga: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: 10Clément Goubert) [18:51:36] (03CR) 10Dzahn: [C:03+2] admin: add SSH key and restricted group membership for neslihanturan [puppet] - 10https://gerrit.wikimedia.org/r/1199484 (https://phabricator.wikimedia.org/T406590) (owner: 10Dzahn) [18:51:55] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [18:58:03] (03CR) 10Muehlenhoff: site: initial setup for new logging-sd hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) (owner: 10Cwhite) [19:00:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [19:15:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320789 (10Dzahn) @Neslihan_Turan_WMDE Your user has just been created on the deployment server now. You have the access. Do you need any other info how to config... [19:15:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320792 (10Dzahn) 05In progress→03Resolved a:03Dzahn [19:16:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320807 (10Dzahn) ` deploy1003:~] $ id neslihanturan uid=17901(neslihanturan) gid=500(wikidev) groups=500(wikidev),706(restricted),714(airflow-deployers) ` [19:23:31] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2005-dev.codfw.wmnet with OS trixie [19:24:16] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudnet2006-dev.codfw.wmnet with OS trixie [19:26:41] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy7001.magru.wmnet [19:26:43] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:28:48] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2004-dev.codfw.wmnet with OS trixie [19:29:22] (03CR) 10JHathaway: Add the sre.hosts.powercycle cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [19:30:28] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy7001.magru.wmnet - dzahn@cumin2002" [19:30:32] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy7001.magru.wmnet - dzahn@cumin2002" [19:30:33] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:30:33] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy7001.magru.wmnet on all recursors [19:30:36] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy7001.magru.wmnet on all recursors [19:31:10] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy7001.magru.wmnet - dzahn@cumin2002" [19:31:16] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy7001.magru.wmnet - dzahn@cumin2002" [19:31:28] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy7001.magru.wmnet with OS trixie [19:31:41] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11320900 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host... [19:31:59] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:32:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:37:00] (03CR) 10JHathaway: [C:03+2] dmarc: add dmarc monitoring records to more domains [dns] - 10https://gerrit.wikimedia.org/r/1198598 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway) [19:37:57] !log jhathaway@dns1004 START - running authdns-update [19:38:19] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:39:23] !log jhathaway@dns1004 END - running authdns-update [19:40:28] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage [19:44:32] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage [19:45:09] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage [19:48:59] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage [19:51:41] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy7002.magru.wmnet [19:51:43] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:54:56] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11320930 (10VRiley-WMF) Attempted to swap the unit and it wouldn't power back on. Swapped it back out with the old one, and it still won't power on. Check... [19:57:08] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy7002.magru.wmnet - dzahn@cumin2002" [19:57:34] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy7002.magru.wmnet - dzahn@cumin2002" [19:57:35] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:57:35] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy7002.magru.wmnet on all recursors [19:57:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy7002.magru.wmnet on all recursors [19:58:11] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy7002.magru.wmnet - dzahn@cumin2002" [19:58:19] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy7002.magru.wmnet - dzahn@cumin2002" [19:58:50] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy7002.magru.wmnet with OS trixie [19:59:04] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11320950 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host... [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T2000). [20:00:05] Msz2001: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:41] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1196775 (owner: 10Ncmonitor) [20:00:50] I'm going to deploy [20:00:55] !log brett@dns1004 START - running authdns-update [20:01:44] !log brett@dns1004 END - running authdns-update [20:02:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199466 (https://phabricator.wikimedia.org/T408542) (owner: 10Mszwarc) [20:02:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199467 (https://phabricator.wikimedia.org/T408542) (owner: 10Mszwarc) [20:03:29] (03Merged) 10jenkins-bot: hCaptcha: Store risk score in cache, so that jobs can use it [extensions/ConfirmEdit] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199466 (https://phabricator.wikimedia.org/T408542) (owner: 10Mszwarc) [20:04:04] (03Merged) 10jenkins-bot: hCaptcha: Store risk score in cache, so that jobs can use it [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199467 (https://phabricator.wikimedia.org/T408542) (owner: 10Mszwarc) [20:04:41] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1199466|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]], [[gerrit:1199467|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]] [20:04:53] T408542: hCaptcha: Store risk score in global memcache key - https://phabricator.wikimedia.org/T408542 [20:06:58] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1199466|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]], [[gerrit:1199467|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:36] !log mszwarc@deploy2002 mszwarc: Continuing with sync [20:08:40] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2006-dev.codfw.wmnet with OS trixie [20:12:08] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199466|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]], [[gerrit:1199467|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]] (duration: 07m 27s) [20:12:16] T408542: hCaptcha: Store risk score in global memcache key - https://phabricator.wikimedia.org/T408542 [20:13:41] (03CR) 10BCornwall: [V:03+2 C:03+2] varnish: Promote new m-dot redirect from 302/307 to 301/308 [puppet] - 10https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [20:14:49] (03CR) 10BCornwall: [C:03+2] varnishtest: Remove logfile support [puppet] - 10https://gerrit.wikimedia.org/r/1199068 (https://phabricator.wikimedia.org/T408202) (owner: 10BCornwall) [20:14:55] (03CR) 10BCornwall: varnishtest: Remove logfile support [puppet] - 10https://gerrit.wikimedia.org/r/1199068 (https://phabricator.wikimedia.org/T408202) (owner: 10BCornwall) [20:17:26] 06SRE, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321005 (10Peachey88) [20:20:38] 06SRE, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321044 (10jhathaway) @Krd thanks, I'm investigating, not sure of the cause either. [20:20:57] 06SRE, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321045 (10jhathaway) p:05Triage→03High [20:23:19] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:48] 06SRE, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321064 (10Krd) Non-representative example: From MAILER-DAEMON Tue Oct 28 20:21:46 2025 Received: from mx-in1001.wikimedia.org ([2620:0:861:4:208:80:155:102]:55514) by vrts1003.eq... [20:24:24] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:25:02] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy7001.magru.wmnet with OS trixie [20:25:03] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy7001.magru.wmnet [20:25:12] 06SRE, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321067 (10Krd) Ir appears to me that we are accepting bounces from phishing e-mails sent with fake sender info@wikipedia.org. [20:25:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321069 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-... [20:26:24] 06SRE, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321073 (10Krd) The 219.240.37.89 looks like a common factor. Can we block this source IP for SMTP as a first measure? [20:29:20] !log Deployed change to private Suggested Investigations code [20:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:34] Freeing the window, I deployed all that I planned [20:33:03] (03CR) 10Herron: [C:03+1] "LGTM once the ferm/nftables bit is sorted out!" [puppet] - 10https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) (owner: 10Cwhite) [20:33:19] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:24] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:41] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy7002.magru.wmnet with reason: host reimage [20:44:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy7002.magru.wmnet with reason: host reimage [20:48:40] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11321177 (10RLazarus) 05Open→03Resolved [20:49:22] 06SRE, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321182 (10jhathaway) >>! In T408632#11321073, @Krd wrote: > The 219.240.37.89 looks like a common factor. Can we block this source IP for SMTP as a first measure? done, though a pr... [20:50:25] Hello, all! The Abstract Wikipedia team needs to do a semi-urgent deployment of backend services. I notice that the Web Team deployment window is coming up in ten minutes, but is rarely used. [20:50:36] Will the Web Team be using that window today, or can I grab it? [20:52:10] marostegui@cumin1003 clone (PID 543428) is awaiting input [20:57:20] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639 (10SherryYang-WMF) 03NEW [20:58:19] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T2100) [21:01:37] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy7002.magru.wmnet with OS trixie [21:01:37] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy7002.magru.wmnet [21:01:48] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321209 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-... [21:16:08] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408585#11321251 (10wiki_willy) a:03VRiley-WMF [21:17:55] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600#11321257 (10wiki_willy) a:03VRiley-WMF [21:18:16] (03PS4) 10Cwhite: site: initial setup for new logging-sd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) [21:19:15] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321260 (10Dzahn) [21:21:58] (03PS1) 10Cory Massaro: Wikifunctions: Upgrade orchestrator from 2025-10-22-011302 to 2025-10-28-205854. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199504 (https://phabricator.wikimedia.org/T406540) [21:24:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) (owner: 10Cwhite) [21:26:20] (03PS1) 10Cory Massaro: Update function-evaluators from 2025-10-21-143846 to 2025-10-28-150053. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199505 (https://phabricator.wikimedia.org/T407718) [21:26:44] (03PS2) 10Cory Massaro: Wikifunctions: Update function-evaluators from 2025-10-21-143846 to 2025-10-28-150053. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199505 (https://phabricator.wikimedia.org/T407718) [21:27:17] (03CR) 10Bking: [C:03+1] hadoop: cleanup /tmp from directories as well as files [puppet] - 10https://gerrit.wikimedia.org/r/1199334 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [21:27:58] 06SRE, 06Data-Engineering: stat1011: cannot create directory ‘/srv/published/datasets/one-off’: Permission denied - https://phabricator.wikimedia.org/T408641 (10Addshore) 03NEW [21:28:14] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [21:28:42] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [21:37:27] (03CR) 10Cwhite: [C:03+2] site: initial setup for new logging-sd hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) (owner: 10Cwhite) [21:44:07] (03PS1) 10JHathaway: postfix: add rspamd network discard map [puppet] - 10https://gerrit.wikimedia.org/r/1199507 (https://phabricator.wikimedia.org/T408632) [21:44:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199507 (https://phabricator.wikimedia.org/T408632) (owner: 10JHathaway) [21:50:04] (03CR) 10JHathaway: [C:03+2] postfix: add rspamd network discard map [puppet] - 10https://gerrit.wikimedia.org/r/1199507 (https://phabricator.wikimedia.org/T408632) (owner: 10JHathaway) [22:04:24] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:21:10] (03PS1) 10Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1199512 [22:22:08] 06SRE, 10vrts, 10Znuny, 13Patch-For-Review: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321474 (10jhathaway) @Krd how else can I help? [22:22:55] (03PS2) 10Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1199512 [22:23:02] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199512 (owner: 10Andrew Bogott) [22:23:27] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321486 (10Dzahn) @thcipriani Turns out this ticket might change from "restricted" to a full deployment access request. How about your approval if that was the case? [22:24:24] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:25:02] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321488 (10Dzahn) [22:26:38] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321493 (10Dzahn) edited ticket to change request from "restricted" to "deployment" after talking to Sean. We will redo the approvals for that but reuse the ticket. [22:28:08] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321498 (10Dzahn) a:05Dzahn→03thcipriani @seanleong-WMDE Could you add some context re: the request for deployment? @thcipriani for your consideration one more time [22:28:48] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321503 (10Dzahn) [22:30:57] (03PS3) 10Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1199512 [22:32:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199512 (owner: 10Andrew Bogott) [22:33:06] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy7001.magru.wmnet with OS trixie [22:33:16] (03CR) 10RLazarus: [C:03+2] {api,rest}-gateway: Update to Envoy 1.32.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199085 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [22:33:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321519 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host... [22:35:01] (03Merged) 10jenkins-bot: {api,rest}-gateway: Update to Envoy 1.32.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199085 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [22:37:50] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2002.codfw.wmnet [22:37:52] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [22:38:26] (03PS4) 10Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1199512 [22:38:39] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199512 (owner: 10Andrew Bogott) [22:38:48] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [22:39:00] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [22:41:21] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2002.codfw.wmnet - dzahn@cumin2002" [22:41:56] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2002.codfw.wmnet - dzahn@cumin2002" [22:41:56] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:41:57] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2002.codfw.wmnet on all recursors [22:42:00] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2002.codfw.wmnet on all recursors [22:42:13] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [22:42:21] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [22:42:32] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy2002.codfw.wmnet - dzahn@cumin2002" [22:42:38] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy2002.codfw.wmnet - dzahn@cumin2002" [22:42:58] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy2002.codfw.wmnet with OS trixie [22:43:11] (03PS5) 10Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1199512 [22:43:14] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321544 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host... [22:43:16] dzahn@cumin2002 reimage (PID 1675734) is awaiting input [22:43:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199512 (owner: 10Andrew Bogott) [22:43:33] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy3002.esams.wmnet with OS trixie [22:43:52] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host... [22:45:30] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321550 (10Dzahn) [22:46:27] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321551 (10Dzahn) All VMs exist now. --> https://netbox.wikimedia.org/search/?q=tcp-proxy some still need t... [22:57:32] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321576 (10seanleong-WMDE) Hi, thanks @Dzahn. The ticket has been changed from "restricted" to "deployment", as this is part of the requirements to be a deployer, and "restricted" is... [22:58:51] (03PS1) 10Scott French: mw-(api-ext|web): scale next releases to 20% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199513 (https://phabricator.wikimedia.org/T405955) [22:58:52] (03PS1) 10Scott French: mw-(api-int|jobrunner): serve 10% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199514 (https://phabricator.wikimedia.org/T405955) [22:58:55] (03PS1) 10Scott French: Enroll 25% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199515 (https://phabricator.wikimedia.org/T405955) [22:59:25] (03CR) 10RLazarus: [C:03+1] mw-(api-ext|web): scale next releases to 20% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199513 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [22:59:29] (03CR) 10RLazarus: [C:03+1] Enroll 25% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199515 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [22:59:32] (03CR) 10RLazarus: [C:03+1] mw-(api-int|jobrunner): serve 10% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199514 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [23:00:52] 06SRE: puppetdb import job on netbox fails - Cannot retrieve PuppetDB 'networking' facts about tcp-proxy3002 - https://phabricator.wikimedia.org/T408646 (10Dzahn) 03NEW [23:01:09] 06SRE: puppetdb import job on netbox fails - Cannot retrieve PuppetDB 'networking' facts for new VMs - https://phabricator.wikimedia.org/T408646#11321593 (10Dzahn) [23:03:06] 06SRE: puppetdb import job on netbox fails - Cannot retrieve PuppetDB 'networking' facts for new VMs - https://phabricator.wikimedia.org/T408646#11321597 (10Dzahn) [23:03:07] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321598 (10seanleong-WMDE) [23:03:13] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy2002.codfw.wmnet with reason: host reimage [23:06:46] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321600 (10seanleong-WMDE) [23:09:37] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy2002.codfw.wmnet with reason: host reimage [23:12:26] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:14:00] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf group for jpchev - https://phabricator.wikimedia.org/T408636#11321623 (10Dzahn) @Jpchev Hi there, are you a Wikimedia Foundation employee or contractor? Or are you asking for access as a volunteer? Any specific systems you have in mind? [23:14:33] (03PS1) 10RLazarus: mw-*: Upgrade to Envoy 1.32.12 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199519 (https://phabricator.wikimedia.org/T405808) [23:16:43] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321629 (10seanleong-WMDE) [23:17:26] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:21:21] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: puppetdb import job on netbox fails - Cannot retrieve PuppetDB 'networking' facts for new VMs - https://phabricator.wikimedia.org/T408646#11321640 (10Dzahn) [23:25:53] (03CR) 10Scott French: [C:03+1] mw-*: Upgrade to Envoy 1.32.12 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199519 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [23:26:33] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy2002.codfw.wmnet with OS trixie [23:26:35] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy2002.codfw.wmnet [23:26:41] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy7001.magru.wmnet with OS trixie [23:26:53] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321656 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-... [23:26:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321657 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-... [23:28:17] (03CR) 10RLazarus: [C:03+2] mw-*: Upgrade to Envoy 1.32.12 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199519 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [23:28:24] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11321663 (10Dzahn) Hello @SherryYang-WMF, re: the "wmf" LDAP group Please take a look here: https://wikitech.wikimedia.... [23:28:35] jouncebot: nowandnext [23:28:35] No deployments scheduled for the next 0 hour(s) and 31 minute(s) [23:28:35] In 0 hour(s) and 31 minute(s): Abstract Wikipedia emergency deploy window (one-off) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T0000) [23:28:58] I'll deploy an envoy upgrade to mw-debug and the canaries [23:30:17] (03Merged) 10jenkins-bot: mw-*: Upgrade to Envoy 1.32.12 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199519 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [23:32:44] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [23:32:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:33:12] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [23:33:40] (03CR) 10Atieno: [C:03+1] ExtensionDistributor: Mark 1.45 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: 10Arlolra) [23:35:23] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [23:35:42] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [23:37:05] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy3002.esams.wmnet with OS trixie [23:37:26] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-... [23:38:01] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1199519 T405808 [23:38:07] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [23:39:24] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:39:38] 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11321680 (10Dzahn) The link above is common example. The user asks for `analytics-privatedata-users` (or is told to ask for it as part of some onboarding docs). But that is... [23:40:41] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1199519 T405808 (duration: 03m 34s) [23:43:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321712 (10Dzahn) [23:44:02] (03PS1) 10Zabe: Using Hadoop for MostTranscludedPages on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199522 (https://phabricator.wikimedia.org/T309738) [23:44:06] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [23:46:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:48:57] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11321730 (10Papaul) [23:59:47] (03PS1) 10Santiago Faci: Metrics Platform PHP client library: set performer_registration_dt as null when the user is anon [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199524 (https://phabricator.wikimedia.org/T408547)