[00:05:40] RESOLVED: DiskSpace: Disk space ml-serve1012:9100:/ 0.6734% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1198415 [00:08:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1198415 (owner: 10TrainBranchBot) [00:23:22] 06SRE, 10SRE-Access-Requests: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11304988 (10Ahoelzl) Approved. [00:24:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:26:05] 06SRE, 10SRE-Access-Requests: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11304990 (10Ahoelzl) @RLazarus sorry for the nuisance, but we would appreciate if this ticket could be expedited. 🙏 [00:31:25] (03PS4) 10Scott French: P:cache::varish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) [00:31:25] (03CR) 10Scott French: "This is the start of a patche series that implements custom rate limits for identified known clients (X-Trust-Score "B") in Varnish - i.e." [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [00:33:32] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1198415 (owner: 10TrainBranchBot) [00:54:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:01:19] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:15:05] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 45s) [01:34:13] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:39:13] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:16] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:25:35] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [03:13:18] (03CR) 10Anzx: [C:03+1] azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [03:24:16] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:49:16] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [04:00:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:07] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:23:57] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30036 bytes in 0.846 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:24:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:54:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:05:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:58] 06SRE, 10Hiddenparma, 06Traffic, 13Patch-For-Review: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11305164 (10Joe) >>! In T404826#11304584, @bd808 wrote: > This is a concern that should likely be discussed elsewhere, but to make it known I will state... [05:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:13] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:16] (03PS1) 10Giuseppe Lavagetto: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) [05:44:04] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7416/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [05:48:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:49:15] (03PS2) 10Giuseppe Lavagetto: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) [05:50:34] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7417/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [05:53:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:56:00] (03PS3) 10Giuseppe Lavagetto: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251024T0600) [06:02:25] (03PS4) 10Giuseppe Lavagetto: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) [06:03:53] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7420/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [06:09:07] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:10:57] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 0.982 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:16:00] (03PS5) 10Krinkle: P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [06:19:16] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:19:55] (03PS1) 10MusikAnimal: CodeMirrorWikiEditor: fix selector usurping WikiEditor's search btn [extensions/CodeMirror] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198425 (https://phabricator.wikimedia.org/T404543) [06:20:37] (03CR) 10MusikAnimal: [C:04-2] "no deploys on Fridays" [extensions/CodeMirror] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198425 (https://phabricator.wikimedia.org/T404543) (owner: 10MusikAnimal) [06:27:01] (03PS5) 10Giuseppe Lavagetto: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) [06:28:22] (03CR) 10CI reject: [V:04-1] varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [06:32:58] (03PS6) 10Giuseppe Lavagetto: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) [06:35:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11305224 (10elukey) The IDRAC 10 support for provision + upgrade-firmware is done, we should be good from the I/F and dcops point of view :) [06:35:40] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7422/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [06:37:02] (03PS7) 10Giuseppe Lavagetto: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) [06:38:26] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7423/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [06:41:30] (03PS8) 10Giuseppe Lavagetto: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) [06:41:53] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11305226 (10elukey) @RKemper thank a lot for the update! Just to clarify, the SLO dashboarding will be done usi... [06:43:29] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7424/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [06:45:15] (03CR) 10Slyngshede: [C:03+1] admin: add vicaplet-wmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1198320 (https://phabricator.wikimedia.org/T407605) (owner: 10Kamila Součková) [06:46:01] (03CR) 10Slyngshede: [C:03+1] admin: add skaramwmf to analytics-private-data-users [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) (owner: 10Kamila Součková) [06:51:20] (03PS2) 10Krinkle: wmf-config: Stop sending HTTP purges for mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198412 (https://phabricator.wikimedia.org/T405931) [06:51:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198412 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [06:52:43] (03Merged) 10jenkins-bot: wmf-config: Stop sending HTTP purges for mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198412 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [06:53:15] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1198412|wmf-config: Stop sending HTTP purges for mobile domains (T405931)]] [06:53:20] T405931: [Clean up] Redirect m-dot URLs to canonical domains - https://phabricator.wikimedia.org/T405931 [06:53:32] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for preparing the change" [puppet] - 10https://gerrit.wikimedia.org/r/1198393 (https://phabricator.wikimedia.org/T408064) (owner: 10Dzahn) [06:57:46] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1198412|wmf-config: Stop sending HTTP purges for mobile domains (T405931)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:58:45] (03PS1) 10Elukey: profile::amd_gpu: add link for libdrm_amdgpu with ROCm 7.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1198428 (https://phabricator.wikimedia.org/T403697) [06:59:16] (03PS1) 10Krinkle: varnish: Promote new m-dot redirect from 302/307 to 301/308 [puppet] - 10https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) [06:59:18] (03PS1) 10Krinkle: varnish: Remove temporary enable_m_redir flag [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251024T0700) [07:00:22] !log krinkle@deploy2002 krinkle: Continuing with sync [07:03:22] (03PS2) 10Elukey: profile::amd_gpu: add link for libdrm_amdgpu with ROCm 7.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1198428 (https://phabricator.wikimedia.org/T403697) [07:03:33] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198428 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [07:06:51] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198412|wmf-config: Stop sending HTTP purges for mobile domains (T405931)]] (duration: 13m 35s) [07:06:56] T405931: [Clean up] Redirect m-dot URLs to canonical domains - https://phabricator.wikimedia.org/T405931 [07:16:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11305267 (10elukey) 05Open→03Resolved a:03elukey [07:17:59] (03PS9) 10Giuseppe Lavagetto: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) [07:17:59] (03PS1) 10Giuseppe Lavagetto: varnishtest: allow including stubs for files from volatile [puppet] - 10https://gerrit.wikimedia.org/r/1198432 [07:18:46] (03CR) 10CI reject: [V:04-1] varnishtest: allow including stubs for files from volatile [puppet] - 10https://gerrit.wikimedia.org/r/1198432 (owner: 10Giuseppe Lavagetto) [07:19:39] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [07:21:52] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [07:22:40] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [07:23:21] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [07:24:16] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:25:19] (03PS2) 10Daniel Kinzler: api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) [07:26:32] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [07:27:20] (03PS2) 10Giuseppe Lavagetto: varnishtest: allow including stubs for files from volatile [puppet] - 10https://gerrit.wikimedia.org/r/1198432 [07:27:54] (03PS1) 10Filippo Giunchedi: installserver: test cloudcontrol2010-dev on UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1198434 (https://phabricator.wikimedia.org/T407586) [07:27:57] (03CR) 10CI reject: [V:04-1] varnishtest: allow including stubs for files from volatile [puppet] - 10https://gerrit.wikimedia.org/r/1198432 (owner: 10Giuseppe Lavagetto) [07:29:13] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:29:45] (03CR) 10Elukey: [C:03+1] installserver: test cloudcontrol2010-dev on UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1198434 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [07:30:00] (03CR) 10Filippo Giunchedi: [C:03+2] installserver: test cloudcontrol2010-dev on UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1198434 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [07:30:10] (03CR) 10Fabfur: [C:03+1] varnishtest: allow including stubs for files from volatile [puppet] - 10https://gerrit.wikimedia.org/r/1198432 (owner: 10Giuseppe Lavagetto) [07:33:15] (03CR) 10Daniel Kinzler: "I tried to test this on top of I00481c1d8efd02d689 and it crashed:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [07:35:46] (03CR) 10Daniel Kinzler: "I just realized I was testing with 1.26.8-1. Maybe it works with 1.29." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [07:37:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [07:40:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:41:53] (03CR) 10Dpogorzelski: [V:03+1] profile::amd_gpu: add link for libdrm_amdgpu with ROCm 7.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1198428 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [07:42:21] (03CR) 10Dpogorzelski: [V:03+1 C:03+1] profile::amd_gpu: add link for libdrm_amdgpu with ROCm 7.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1198428 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [07:43:41] (03CR) 10Gehel: wdqs.data-transfer: make --force behavior default (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198399 (https://phabricator.wikimedia.org/T408163) (owner: 10Ryan Kemper) [07:44:25] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: add link for libdrm_amdgpu with ROCm 7.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1198428 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [07:46:56] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [07:49:16] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:52:14] (03CR) 10Dpogorzelski: [C:03+2] team-ml: Change helmfile_admin_ng_pending_changes alert to fire after 1w [alerts] - 10https://gerrit.wikimedia.org/r/1198321 (https://phabricator.wikimedia.org/T403047) (owner: 10Klausman) [07:54:08] (03Merged) 10jenkins-bot: team-ml: Change helmfile_admin_ng_pending_changes alert to fire after 1w [alerts] - 10https://gerrit.wikimedia.org/r/1198321 (https://phabricator.wikimedia.org/T403047) (owner: 10Klausman) [08:00:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:34] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [08:15:23] (03PS1) 10Elukey: Add MI300X node taints to ml-serve1012 [puppet] - 10https://gerrit.wikimedia.org/r/1198470 (https://phabricator.wikimedia.org/T403697) [08:15:37] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198470 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [08:24:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:40:15] (03PS1) 10Stevemunene: LVS: etcd data for druid-public-coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) [08:40:17] (03PS1) 10Stevemunene: LVS: Add druid-public-coordinator to service list [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) [08:40:59] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305452 (10BTullis) a:03BTullis I can pick up this ticket, since I work with Justin in the Dat... [08:41:34] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305460 (10BTullis) [08:43:06] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305464 (10BTullis) p:05Triage→03High [08:43:43] !log cleanup old jar files on an-worker nodes - T396582 - sudo cumin A:hadoop-worker 'find /tmp -name *.jar -mtime +30 -delete' [08:43:44] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11305467 (10elukey) I've worked with Filippo this morning, we flipped the host to UEFI via provision cookbook and updated the partman partitions. Same error as T407586#... [08:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:48] T396582: Improve housekeeping of jar files in /tmp on Hadoop workers - https://phabricator.wikimedia.org/T396582 [08:46:31] (03PS1) 10Stevemunene: DNS: Add druid-public-coordinator record [dns] - 10https://gerrit.wikimedia.org/r/1198500 (https://phabricator.wikimedia.org/T406222) [08:46:48] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305472 (10BTullis) [08:46:53] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305473 (10BTullis) Following the procedure as defined at: https://wikitech.wikimedia.org/wiki/S... [08:50:55] (03CR) 10Dpogorzelski: [C:03+1] Add MI300X node taints to ml-serve1012 [puppet] - 10https://gerrit.wikimedia.org/r/1198470 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [08:54:03] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11305493 (10elukey) ` grub rescue> set fw_path='(hd3,gpt2)/EFI/debian' prefix='(lvmid/XSJkyY-vEFM-dddW-stYw-b7co-lCY3-xPhPjs/mz1h8d-DR5U-bbX5-Ii1P-CcHG-eaxf-yo6lqG)/boo... [08:54:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:55:31] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305495 (10BTullis) I verified that the supplied SSH key has not been used in Cloud Services wit... [08:55:46] (03PS1) 10Jcrespo: transferpy: Fix the check for empty directories [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198501 [08:58:12] (03CR) 10Btullis: DNS: Add druid-public-coordinator record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1198500 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [08:58:35] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305504 (10BTullis) [09:01:25] (03PS1) 10Brouberol: kubernetes: define a postgresql-documentdb common image [puppet] - 10https://gerrit.wikimedia.org/r/1198502 (https://phabricator.wikimedia.org/T406578) [09:02:15] (03CR) 10Btullis: [C:03+1] kubernetes: define a postgresql-documentdb common image [puppet] - 10https://gerrit.wikimedia.org/r/1198502 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:02:42] (03CR) 10Brouberol: [C:03+2] kubernetes: define a postgresql-documentdb common image [puppet] - 10https://gerrit.wikimedia.org/r/1198502 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:11:05] (03CR) 10Elukey: [C:03+2] Add MI300X node taints to ml-serve1012 [puppet] - 10https://gerrit.wikimedia.org/r/1198470 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [09:16:46] (03PS1) 10Cathal Mooney: Nokia DHCP-Relay: use IRB interface IP [homer/public] - 10https://gerrit.wikimedia.org/r/1198506 (https://phabricator.wikimedia.org/T402577) [09:18:04] (03CR) 10CI reject: [V:04-1] Nokia DHCP-Relay: use IRB interface IP [homer/public] - 10https://gerrit.wikimedia.org/r/1198506 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:19:55] (03PS2) 10Cathal Mooney: Nokia DHCP-Relay: use IRB interface IP [homer/public] - 10https://gerrit.wikimedia.org/r/1198506 (https://phabricator.wikimedia.org/T402577) [09:20:56] (03CR) 10Cathal Mooney: Nokia DHCP-Relay: use IRB interface IP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1198506 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:22:00] (03PS3) 10Cathal Mooney: Nokia DHCP-Relay: use IRB interface IP [homer/public] - 10https://gerrit.wikimedia.org/r/1198506 (https://phabricator.wikimedia.org/T402577) [09:24:13] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:12] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305546 (10BTullis) I have created the patch to enable shell access and co... [09:37:27] (03CR) 10Clément Goubert: "> There is another issue: we also add the typed_per_filter_config key for specifying per-route rate limits (or opt out of rate limits). I " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [09:38:06] (03PS12) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [09:39:13] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:32] (03CR) 10CI reject: [V:04-1] api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [09:44:13] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:42] (03CR) 10Daniel Kinzler: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:13] (03PS4) 10Daniel Kinzler: api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 [09:52:23] (03PS1) 10Brouberol: postgresql-growthbook: deploy a cluster using PG17 + documentdb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198512 (https://phabricator.wikimedia.org/T406579) [09:52:27] (03PS1) 10Brouberol: cloudnative-pg-cluster: allow the deployment of a custom PG image with extensions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198513 (https://phabricator.wikimedia.org/T406578) [09:52:29] (03PS1) 10Brouberol: postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) [09:53:12] (03PS3) 10Daniel Kinzler: api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) [09:53:21] (03Abandoned) 10Brouberol: postgresql-growthbook: deploy a cluster using PG17 + documentdb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198512 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:53:28] (03PS2) 10Brouberol: cloudnative-pg-cluster: allow the deployment of a custom PG image with extensions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198513 (https://phabricator.wikimedia.org/T406578) [09:53:29] (03PS2) 10Brouberol: postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:27] (03CR) 10Elukey: [C:03+2] "To keep archives happy - in k8s 1.23 IIRC this doesn't work after the kubelet has been registered to the k8s api. I did the following:" [puppet] - 10https://gerrit.wikimedia.org/r/1198470 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [09:57:46] (03PS13) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [10:02:51] (03PS1) 10Cathal Mooney: gnmic: split bgp collection targets for Juniper/Nokia [puppet] - 10https://gerrit.wikimedia.org/r/1198515 (https://phabricator.wikimedia.org/T393996) [10:03:52] (03CR) 10Cathal Mooney: [C:03+2] gnmic: split bgp collection targets for Juniper/Nokia [puppet] - 10https://gerrit.wikimedia.org/r/1198515 (https://phabricator.wikimedia.org/T393996) (owner: 10Cathal Mooney) [10:07:07] (03CR) 10Btullis: [C:03+1] "Looks good. I have asked a question separately about whether we should try to embed a `$PG_MAJOR` version number in the image tag, at the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198513 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [10:08:43] (03CR) 10Btullis: [C:03+1] postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [10:09:57] (03CR) 10Cathal Mooney: [C:03+2] Include statements in reverse zones for new subnets [dns] - 10https://gerrit.wikimedia.org/r/1198370 (https://phabricator.wikimedia.org/T396063) (owner: 10Cathal Mooney) [10:10:26] !log cmooney@dns2005 START - running authdns-update [10:11:07] !log cmooney@dns2005 END - running authdns-update [10:19:16] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:20:19] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305659 (10BTullis) @JMoore-WMF There has been a recent change in the proc... [10:20:35] (03PS1) 10Majavah: alertmanager: Disable task creation for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/1198517 [10:28:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS bookworm [10:28:21] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11305674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet... [10:32:31] (03PS3) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [10:38:45] (03CR) 10Clément Goubert: [C:04-1] "Most of the changes are fine, but I'd rather we change the `api-gateway` deployments of this chart as little as possible." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [10:40:13] (03CR) 10Clément Goubert: [C:04-1] "One indentation issue that will probably break, and one comment, otherwise it looks ok to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [10:47:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [10:49:04] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [10:49:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11305694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [10:50:16] (03CR) 10Clément Goubert: [C:04-1] api-gateway: support per-route rate limit groups for rest gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [10:53:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [10:54:13] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:57:10] (03CR) 10Clément Goubert: "Overall LGTM, but I'd rather we use metadata to share configuration with the lua script rather than template hardcoded values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [10:59:13] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251024T0700) [11:00:05] jelto, arnoldokoth, and mutante: gettimeofday() says it's time for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251024T1100) [11:08:10] (03PS19) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [11:08:10] (03PS20) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [11:08:10] (03PS20) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [11:08:10] (03PS4) 10Btullis: Update the apt components used for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1198287 (https://phabricator.wikimedia.org/T407199) [11:08:11] (03PS6) 10Btullis: Change the component from where we install elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) [11:10:55] (03CR) 10CI reject: [V:04-1] Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [11:11:40] (03PS6) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [11:12:12] (03PS7) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [11:12:34] (03CR) 10Filippo Giunchedi: [C:03+1] alertmanager: Disable task creation for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/1198517 (owner: 10Majavah) [11:13:04] (03CR) 10Majavah: [C:03+2] alertmanager: Disable task creation for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/1198517 (owner: 10Majavah) [11:13:43] (03PS12) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [11:14:04] (03PS8) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [11:14:26] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: allow the deployment of a custom PG image with extensions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198513 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:14:40] (03CR) 10Brouberol: [C:03+2] "Yes, I think we should indeed!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198513 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:14:43] (03CR) 10Brouberol: [V:03+2 C:03+2] cloudnative-pg-cluster: allow the deployment of a custom PG image with extensions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198513 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:14:52] (03PS9) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [11:16:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [11:16:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2078.codfw.wmnet with OS bookworm [11:16:57] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11305763 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with... [11:17:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [11:24:03] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:26] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:58] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:26:01] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [11:26:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS bookworm [11:26:15] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11305770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet... [11:26:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:26:34] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [11:27:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [11:27:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [11:27:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [11:30:26] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:30:27] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:31:23] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305801 (10JMoore-WMF) [11:33:40] (03PS1) 10Jcrespo: transferpy: Force ipv4 usage for now, fix bug with found port [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198521 [11:34:00] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305805 (10JMoore-WMF) [11:34:03] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [11:40:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [11:44:48] (03CR) 10Federico Ceratto: "I added the safety check to avoid running the cookbook against masters and tested it with db1176 and db2230" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [11:45:20] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [11:49:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [11:50:26] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:50:27] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:52:43] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [11:54:03] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:55:27] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:57:02] (03CR) 10Kamila Součková: [C:03+2] admin: add vicaplet-wmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1198320 (https://phabricator.wikimedia.org/T407605) (owner: 10Kamila Součková) [11:57:09] (03CR) 10Kamila Součková: [C:03+2] admin: add skaramwmf to analytics-private-data-users [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) (owner: 10Kamila Součková) [11:59:03] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:43] (03PS2) 10Kamila Součková: admin: add skaramwmf to analytics-private-data-users [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) [12:04:03] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:24] (03CR) 10Kamila Součková: "@slyngshede@wikimedia.org apologies, I'd forgotten to add the user to the group, could you please review again? Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) (owner: 10Kamila Součková) [12:09:20] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1006.eqiad.wmnet with OS trixie [12:09:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11305892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet with... [12:10:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2078.codfw.wmnet with OS bookworm [12:10:15] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11305893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with... [12:21:03] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11305936 (10Raine) [12:21:51] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11305940 (10Raine) a:05Ahoelzl→03BTullis @BTullis can you please approve the analytics-admins access? Thanks! [12:24:09] !log filippo@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol2010-dev'] [12:25:27] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:29:28] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11305956 (10BTullis) @JMoore-WMF has supplied me with an SSH key via our au... [12:31:01] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad racks for variance from Netbox - https://phabricator.wikimedia.org/T407851#11305961 (10Jclark-ctr) a:05Jclark-ctr→03VRiley-WMF @VRiley-WMF Assign back to me when you have completed removing missed servers [12:33:46] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197 (10Jclark-ctr) 03NEW [12:34:31] (03PS1) 10DCausse: CompletionSuggester: fix index id format check [extensions/CirrusSearch] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198529 (https://phabricator.wikimedia.org/T404858) [12:34:40] !log filippo@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol2010-dev'] [12:34:48] !log filippo@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol2010-dev'] [12:34:49] !log sudo manage_principals.py reset-password fabfur --email_address=ffurnari@wikimedia.org: T408193 [12:34:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CirrusSearch] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198529 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [12:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:57] T408193: Requesting Kerberos access for fabfur - https://phabricator.wikimedia.org/T408193 [12:36:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198295 (owner: 10DCausse) [12:38:56] 06SRE, 10DNS, 10Domains, 06Traffic: Request to create the donate.wikipedia25.org domain + 301 redirect to a donate.wiki page - https://phabricator.wikimedia.org/T408168#11306020 (10ssingh) a:03BCornwall [12:40:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS bookworm [12:40:10] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11306025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet... [12:41:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet-wmde - https://phabricator.wikimedia.org/T407605#11306026 (10Raine) 05Open→03Resolved a:03Raine Done, @Virginie.caplet let me know in case something doesn't work :-) [12:41:38] !log filippo@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol2010-dev'] [12:46:05] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11306034 (10Raine) >>! In T408164#11305956, @BTullis wrote: > @JMoore-WMF h... [12:49:18] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11306051 (10fgiunchedi) Today with @elukey's help I upgraded the idrac and bios on cloudcontrol2010-dev, not that I expected it to make a difference (and it didn't) `... [12:54:01] (03PS1) 10Kamila Součková: admin: add a-pizzata to analytics-admins, deployment [puppet] - 10https://gerrit.wikimedia.org/r/1198531 (https://phabricator.wikimedia.org/T407228) [12:54:23] !log filippo@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [12:54:38] (03CR) 10Kamila Součková: [C:04-2] "DNM, waiting for approval" [puppet] - 10https://gerrit.wikimedia.org/r/1198531 (https://phabricator.wikimedia.org/T407228) (owner: 10Kamila Součková) [12:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:56:01] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11306066 (10Raine) [12:58:37] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11306069 (10BTullis) Great! Thanks @Raine I have also confirmed this with... [12:59:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [12:59:55] !log temp disable "automatically reboot after install" d-i options on apt1002 [12:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:16] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11306088 (10JMoore-WMF) requested wmf access, log access, and airflow acces... [13:04:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [13:04:56] (03PS5) 10Daniel Kinzler: api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 [13:05:01] (03CR) 10Daniel Kinzler: api-gateway: support per-route rate limit groups for rest gateway (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [13:10:15] (03PS4) 10Daniel Kinzler: api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) [13:10:18] (03CR) 10Daniel Kinzler: api-gateway: make cookie name configurable for testing (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [13:12:53] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [13:14:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:41] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host clouddumps1001.wikimedia.org [13:23:57] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host clouddumps1001.wikimedia.org [13:25:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2078.codfw.wmnet with OS bookworm [13:25:31] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11306166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with... [13:28:53] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host clouddumps1001.wikimedia.org [13:33:59] (03PS1) 10Jgiannelos: Allow proofread page to use parsoid when parsoid render is requested [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198537 (https://phabricator.wikimedia.org/T278481) [13:35:02] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11306178 (10fgiunchedi) Ok so I did a test with `grub` at version `2.14~git20250718.0e36779-1` (from experimental) and that changed nothing. I'll be trying with bookwor... [13:35:58] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host clouddumps1002.wikimedia.org [13:37:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1001.wikimedia.org [13:37:35] !log filippo@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [13:39:41] PROBLEM - Host ms-be2078 is DOWN: PING CRITICAL - Packet loss = 100% [13:44:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1002.wikimedia.org [13:44:43] RECOVERY - Host ms-be2078 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [13:45:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:27] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:40] filippo@cumin2002 reimage (PID 391358) is awaiting input [13:49:32] 06SRE, 10DNS, 10Domains, 06Traffic: Request to create the donate.wikipedia25.org domain + 301 redirect to a donate.wiki page - https://phabricator.wikimedia.org/T408168#11306210 (10SCampos-WMF) [13:50:02] 06SRE, 10DNS, 10Domains, 06Traffic: Request to create the donate.wikipedia25.org domain + 301 redirect to a donate.wiki page - https://phabricator.wikimedia.org/T408168#11306211 (10SCampos-WMF) [13:50:33] 06SRE, 10DNS, 10Domains, 06Traffic: Request to create the donate.wikipedia25.org domain + 301 redirect to a donate.wiki page - https://phabricator.wikimedia.org/T408168#11306212 (10SCampos-WMF) Hi team, just updated the task with the final link to the donate.wiki portal. [13:52:43] PROBLEM - Host ms-be2078 is DOWN: PING CRITICAL - Packet loss = 100% [13:56:27] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [13:56:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11306225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [13:59:52] !log filippo@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [14:00:38] topranks: FYI I temporarily disabled auto-reboot to debug T407586, meaning sretest1006 will require hitting enter from the console for d-i to reboot into the OS [14:00:40] T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586 [14:00:54] will re-enable soon though [14:01:12] godog: the issue is a problem with Nokia [14:01:21] I need it in a boot loop constantly doing DHCP to debug with them [14:01:31] I just kicked off a reimage, but it helps if tries fairly constantly [14:01:33] hah nevermind then! [14:01:38] thanks though! [14:01:45] sure np [14:02:13] specifically the switch is not sending the reply from install server to the host, though when we tested with their lab gear it worked, I'm trying to determine what is different [14:04:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS trixie [14:05:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:27] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11306307 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet... [14:06:29] Emperor: FYI ^ re: unattended reboots [14:07:13] I've reverted now anyhow, should be fine [14:08:04] godog: thanks for letting me know, I'm watching many of these reinstalls from the console in any case (because I am doing violence to the UEFI partitions on that system) [14:08:19] ok! good luck [14:10:10] :) [14:11:27] (03CR) 10CDanis: [C:03+1] profile::pyrra: add two Xlab SLOs under the data-platform namespace [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [14:13:13] RECOVERY - Host ms-be2078 is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms [14:14:19] not sure why that's there, given it's mid-reimage [14:17:24] (03CR) 10Clément Goubert: api-gateway: support per-route rate limit groups for rest gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [14:17:56] (03PS6) 10Clément Goubert: api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [14:18:51] (03CR) 10Daniel Kinzler: api-gateway: support per-route rate limit groups for rest gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [14:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:24:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [14:26:55] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [14:32:58] (03CR) 10Hnowlan: "This makes sense to me. It might be worth seeing if our upgrades have made the `ADD_IF_ABSENT` header action work - it didn't seem to be o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [14:33:03] !log filippo@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [14:37:28] (03CR) 10Clément Goubert: [C:03+1] api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [14:37:55] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [14:41:37] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11306467 (10mpopov) @JMoore-WMF: Why do you need be added to `analytics-pro... [14:42:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11306470 (10Gehel) p:05Triage→03High [14:44:06] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11306474 (10fgiunchedi) I got so far as getting `heaptrack` to give me a call trace for peak allocations ` PEAK MEMORY CONSUMERS 8.59G peak memory consumed over 9 call... [14:45:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11306479 (10BTullis) Thanks @Jclark-ctr - Just to let you know, you can hot-swap this drive at any time. It doesn't need any action from us, since it is a... [14:45:20] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11306485 (10mpopov) >>! In T408164#11305545, @BTullis wrote: > The `analyti... [14:45:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11306486 (10Jclark-ctr) Thanks will take care of in a few hours [14:46:35] (03CR) 10Clément Goubert: api-gateway: make cookie name configurable for testing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [14:46:42] (03CR) 10Clément Goubert: [C:03+1] api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [14:49:14] (03PS3) 10Novem Linguae: data.yaml: change wiki replica to mediawiki replica [puppet] - 10https://gerrit.wikimedia.org/r/1183247 [14:49:28] (03PS4) 10Novem Linguae: data.yaml: change wiki replica to mediawiki replica [puppet] - 10https://gerrit.wikimedia.org/r/1183247 [14:49:47] (03PS5) 10Novem Linguae: data.yaml: change wiki replica to mediawiki replica [puppet] - 10https://gerrit.wikimedia.org/r/1183247 [14:50:13] (03CR) 10Novem Linguae: "In the latest patchset, I have stripped this patch down to the bare minimum, to increase chances of a review." [puppet] - 10https://gerrit.wikimedia.org/r/1183247 (owner: 10Novem Linguae) [14:50:48] (03CR) 10Majavah: [C:03+1] data.yaml: change wiki replica to mediawiki replica [puppet] - 10https://gerrit.wikimedia.org/r/1183247 (owner: 10Novem Linguae) [14:52:56] (03CR) 10Clément Goubert: wikikube: Add wikikube-worker2[248-330] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [14:53:47] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2078.codfw.wmnet with OS trixie [14:53:55] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11306531 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with... [14:55:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS trixie [14:55:30] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11306536 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet... [14:59:31] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [15:04:31] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [15:09:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11306605 (10bking) Sorry for the delayed response, it's been a tumultuous week. There's a lot of interesting debates happening (user vs... [15:13:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [15:16:47] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1006.eqiad.wmnet with OS trixie [15:17:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [15:19:03] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:et-0/0/30 (Core: ssw1-d8-eqiad:ethernet-1/30 {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:19:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11306704 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet with... [15:21:54] 06SRE, 10conftool: Add suopport to use different vsthrottle keys - https://phabricator.wikimedia.org/T319533#11306752 (10CDanis) 05Open→03Declined [15:27:00] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11306881 (10cmooney) Hi @VRiley-WMF thanks for that. A few little niggles. Firstly neither of the connections look correct in Netbox, they show a single cable connecting the switches, but as... [15:30:27] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:33:48] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:33:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:34:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:34] (03CR) 10Bking: [C:03+1] blazegraph: add cluster sync check [alerts] - 10https://gerrit.wikimedia.org/r/1174723 (https://phabricator.wikimedia.org/T408026) (owner: 10Gmodena) [15:37:51] (03CR) 10Cathal Mooney: [C:03+1] "Ha good spot thank you!" [homer/public] - 10https://gerrit.wikimedia.org/r/1198216 (https://phabricator.wikimedia.org/T201491) (owner: 10Raunak1709) [15:37:55] (03CR) 10Cathal Mooney: [C:03+2] Fix typo in description field of patternProperties in device-generic.schema [homer/public] - 10https://gerrit.wikimedia.org/r/1198216 (https://phabricator.wikimedia.org/T201491) (owner: 10Raunak1709) [15:38:10] (03PS3) 10Btullis: Configure production shell access and posix groups for jmoore111 [puppet] - 10https://gerrit.wikimedia.org/r/1198504 (https://phabricator.wikimedia.org/T408164) [15:39:03] (03CR) 10Ahmon Dancy: scap: remove testservers 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [15:39:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11306956 (10elukey) >>! In T406656#11306605, @bking wrote: > My suggestion is to just auto-commit the hieradata from Netbox. I'm not ev... [15:39:15] (03Merged) 10jenkins-bot: Fix typo in description field of patternProperties in device-generic.schema [homer/public] - 10https://gerrit.wikimedia.org/r/1198216 (https://phabricator.wikimedia.org/T201491) (owner: 10Raunak1709) [15:40:33] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11306974 (10BTullis) Thanks @mpopov - I'm re-scoping to... [15:41:42] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11306979 (10BTullis) [15:41:58] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11306981 (10BTullis) [15:42:51] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [15:43:58] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11306987 (10BTullis) With the new group membership defi... [15:44:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:34] (03PS1) 10Bking: Revert "admin_ng (dse-k8s): watch more OpenSearch-related namespaces" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198553 [15:44:41] (03CR) 10Bking: [V:03+2 C:03+2] Revert "admin_ng (dse-k8s): watch more OpenSearch-related namespaces" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198553 (owner: 10Bking) [15:45:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:46:07] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:46:19] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2078.codfw.wmnet with OS trixie [15:46:27] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11307003 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with... [15:46:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:46:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS bookworm [15:47:01] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11307006 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet... [15:48:44] (03PS1) 10BCornwall: varnishtest: Run docker_run with bash, not sh [puppet] - 10https://gerrit.wikimedia.org/r/1198555 [15:49:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11307038 (10BTullis) >>! In T407228#11305940, @Raine wrote: > @BTullis can you please approve the analytics-admins access? Thanks! Appro... [15:49:58] (03CR) 10Fabfur: [C:03+1] "Thanks for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/1198555 (owner: 10BCornwall) [15:50:27] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:50:54] (03CR) 10BCornwall: [C:03+2] varnishtest: Run docker_run with bash, not sh [puppet] - 10https://gerrit.wikimedia.org/r/1198555 (owner: 10BCornwall) [15:52:46] (03PS1) 10Andrew Bogott: nova.conf: disable spice and remove [spice] config section [puppet] - 10https://gerrit.wikimedia.org/r/1198556 (https://phabricator.wikimedia.org/T406516) [15:53:32] (03CR) 10Andrew Bogott: [C:03+2] nova.conf: disable spice and remove [spice] config section [puppet] - 10https://gerrit.wikimedia.org/r/1198556 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [15:54:28] (03PS1) 10Cathal Mooney: Add BGP from ssw1-d1-eqiad to ssw1-e1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1198557 (https://phabricator.wikimedia.org/T396065) [15:54:42] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11307072 (10BTullis) a:05BTullis→03Raine [15:59:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [16:00:52] (03PS1) 10BCornwall: ncredir: Create donate.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1198558 (https://phabricator.wikimedia.org/T408168) [16:03:02] (03CR) 10Ssingh: [C:03+1] ncredir: Create donate.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1198558 (https://phabricator.wikimedia.org/T408168) (owner: 10BCornwall) [16:05:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [16:05:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [16:05:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [16:09:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [16:10:54] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7451/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198558 (https://phabricator.wikimedia.org/T408168) (owner: 10BCornwall) [16:11:49] FIRING: HelmReleaseBadStatus: Helm release growthbook/ferretdb-growthbook on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=growthbook - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:12:10] (03CR) 10BCornwall: [V:03+1 C:03+2] ncredir: Create donate.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1198558 (https://phabricator.wikimedia.org/T408168) (owner: 10BCornwall) [16:13:45] (03CR) 10Cathal Mooney: [C:03+2] Add BGP from ssw1-d1-eqiad to ssw1-e1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1198557 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [16:14:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:59] (03Merged) 10jenkins-bot: Add BGP from ssw1-d1-eqiad to ssw1-e1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1198557 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [16:15:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [16:16:49] RESOLVED: HelmReleaseBadStatus: Helm release growthbook/ferretdb-growthbook on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=growthbook - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:16:52] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11307160 (10VRiley-WMF) Thanks, currently looking into this. [16:17:24] (03CR) 10Scott French: scap: remove testservers 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [16:20:24] 06SRE, 10DNS, 10Domains, 06Traffic, 13Patch-For-Review: Request to create the donate.wikipedia25.org domain + 301 redirect to a donate.wiki page - https://phabricator.wikimedia.org/T408168#11307174 (10BCornwall) 05Open→03Resolved I've created both donate.wikipedia25.org and donate.wikipedia25.com... [16:25:27] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:30:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2078.codfw.wmnet with OS bookworm [16:44:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:10:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:54] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [17:45:27] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:46:08] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1198583 [17:46:18] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198583 (owner: 10CDanis) [17:46:35] (03CR) 10CI reject: [V:04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/1198583 (owner: 10CDanis) [17:46:43] mutante: what diff is there? [17:47:17] +et-0-0-29.ssw1-f1-eqiad 1H IN A 10.64.147.11 [17:47:21] topranks: ^ [17:47:21] and similar lines [17:47:24] sorry :P [17:47:42] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:48:13] I was asked to type "go" or "abort". I picked abort. [17:48:19] that caused an exception [17:48:39] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1198583 [17:53:22] mutante: yeah that's expected, er the exception [17:53:30] problem is that now this in theory blocks other changes [17:54:00] but yeah, Cathal changed it so we should be OK merging it [17:54:01] see https://netbox.wikimedia.org/ipam/ip-addresses/21680/changelog/ [17:54:14] mutante: we can wait a bit otherwise happy to +1 to merge [17:56:19] sukhe: lets wait a bit and have lunch. thank you! [17:56:48] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work, 13Patch-For-Review: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11307672 (10CDanis) [17:59:05] (03PS1) 10Bking: opensearch-cluster: Enable https-based scrapes from prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198585 (https://phabricator.wikimedia.org/T362114) [17:59:14] (03PS3) 10CDanis: Prometheus metrics for DNS Discovery service state [puppet] - 10https://gerrit.wikimedia.org/r/1198583 (https://phabricator.wikimedia.org/T393966) [18:02:19] (03PS1) 10Andrew Bogott: Revert "nova.conf: disable spice and remove [spice] config section" [puppet] - 10https://gerrit.wikimedia.org/r/1198586 [18:02:52] (03CR) 10CI reject: [V:04-1] Revert "nova.conf: disable spice and remove [spice] config section" [puppet] - 10https://gerrit.wikimedia.org/r/1198586 (owner: 10Andrew Bogott) [18:05:50] (03PS2) 10Andrew Bogott: Revert "nova.conf: disable spice and remove [spice] config section" [puppet] - 10https://gerrit.wikimedia.org/r/1198586 [18:07:08] (03CR) 10Andrew Bogott: [C:03+2] Revert "nova.conf: disable spice and remove [spice] config section" [puppet] - 10https://gerrit.wikimedia.org/r/1198586 (owner: 10Andrew Bogott) [18:11:30] (03PS4) 10CDanis: Prometheus metrics for DNS Discovery service state [puppet] - 10https://gerrit.wikimedia.org/r/1198583 (https://phabricator.wikimedia.org/T393966) [18:16:47] !log cdobbins@puppetserver1001 conftool action : get/weight=1; selector: service=cdn [18:17:24] (03CR) 10RLazarus: [C:03+1] Prometheus metrics for DNS Discovery service state [puppet] - 10https://gerrit.wikimedia.org/r/1198583 (https://phabricator.wikimedia.org/T393966) (owner: 10CDanis) [18:18:05] !log cdobbins@puppetserver1001 conftool action : get/pooled=no; selector: service=cdn [18:18:23] (03CR) 10CDanis: [C:03+2] "works https://phabricator.wikimedia.org/P84297" [puppet] - 10https://gerrit.wikimedia.org/r/1198583 (https://phabricator.wikimedia.org/T393966) (owner: 10CDanis) [18:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:21:46] (03PS1) 10Clare Ming: Add config for xLab MW Module experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198591 (https://phabricator.wikimedia.org/T401705) [18:23:30] (03CR) 10Santiago Faci: [C:03+1] Add config for xLab MW Module experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198591 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [18:28:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198591 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [18:31:47] (03PS2) 10Ebernhardson: dumps: Sync cirrus index dumps from hdfs [puppet] - 10https://gerrit.wikimedia.org/r/1184585 (https://phabricator.wikimedia.org/T366248) [18:32:14] (03CR) 10CI reject: [V:04-1] dumps: Sync cirrus index dumps from hdfs [puppet] - 10https://gerrit.wikimedia.org/r/1184585 (https://phabricator.wikimedia.org/T366248) (owner: 10Ebernhardson) [18:36:50] (03CR) 10Bking: [C:03+2] opensearch-cluster: Enable https-based scrapes from prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198585 (https://phabricator.wikimedia.org/T362114) (owner: 10Bking) [18:37:09] (03CR) 10Bking: [C:03+2] "self-merging, as this will (hopefully) fix metrics scrapes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198585 (https://phabricator.wikimedia.org/T362114) (owner: 10Bking) [18:37:56] (03PS3) 10Ebernhardson: dumps: Sync cirrus index dumps from hdfs [puppet] - 10https://gerrit.wikimedia.org/r/1184585 (https://phabricator.wikimedia.org/T366248) [18:38:34] (03Merged) 10jenkins-bot: opensearch-cluster: Enable https-based scrapes from prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198585 (https://phabricator.wikimedia.org/T362114) (owner: 10Bking) [18:43:02] sukhe: let's merge it :) [18:43:07] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [18:43:10] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [18:43:15] mutante: mutante: on it! [18:43:18] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [18:43:21] thank you :) [18:46:36] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge pending changes - sukhe@cumin1003" [18:46:40] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge pending changes - sukhe@cumin1003" [18:46:40] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:46:46] mutante: all yours! [18:47:12] thanks again, I am going back to your advice now [18:47:38] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:50:22] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:52:33] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2001.codfw.wmnet [18:52:35] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:56:22] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [18:56:28] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [18:56:28] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:56:28] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [18:56:32] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [18:56:36] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:57:48] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11307934 (10VRiley-WMF) PEM has been packaged up and sent out. [19:00:11] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [19:00:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [19:00:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:00:18] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [19:00:21] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [19:00:29] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy2001.codfw.wmnet [19:05:30] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:08:09] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:08:24] (03PS3) 10Ryan Kemper: wdqs.data-transfer: --force behavior now mandatory [cookbooks] - 10https://gerrit.wikimedia.org/r/1198399 (https://phabricator.wikimedia.org/T408163) [19:13:36] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2001.codfw.wmnet [19:13:38] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:16:40] (03PS1) 10Andrew Bogott: nova-compute on trixie: install qemu-system-modules-spice [puppet] - 10https://gerrit.wikimedia.org/r/1198593 (https://phabricator.wikimedia.org/T406516) [19:18:25] (03CR) 10Andrew Bogott: [C:03+2] nova-compute on trixie: install qemu-system-modules-spice [puppet] - 10https://gerrit.wikimedia.org/r/1198593 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [19:19:22] dzahn@cumin2002 makevm (PID 465344) is awaiting input [19:20:27] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:et-0/0/30 (Core: ssw1-d8-eqiad:ethernet-1/30 {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:22:03] (03PS1) 10Bking: opensearch-cluster: update network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198594 (https://phabricator.wikimedia.org/T362114) [19:23:08] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2005-dev.codfw.wmnet with OS trixie [19:23:53] (03CR) 10CI reject: [V:04-1] opensearch-cluster: update network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198594 (https://phabricator.wikimedia.org/T362114) (owner: 10Bking) [19:24:34] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [19:24:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [19:24:40] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:24:40] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [19:24:43] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [19:24:49] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:25:21] (03PS2) 10Bking: opensearch-cluster: update network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198594 (https://phabricator.wikimedia.org/T362114) [19:26:42] (03CR) 10CI reject: [V:04-1] opensearch-cluster: update network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198594 (https://phabricator.wikimedia.org/T362114) (owner: 10Bking) [19:28:04] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [19:28:09] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [19:28:10] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:28:10] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [19:28:13] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [19:28:22] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy2001.codfw.wmnet [19:30:27] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:32:57] (03PS3) 10Bking: opensearch-cluster: update network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198594 (https://phabricator.wikimedia.org/T362114) [19:34:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:01] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:38:43] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:39:41] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage [19:43:08] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:43:10] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:44:37] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage [19:44:51] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:44:53] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:45:39] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:45:41] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:50:27] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:51:03] 07Puppet, 10MobileFrontend (Tracking): Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425#11308121 (10Krinkle) a:03Krinkle Done as of yesterday, via {T405931}. [19:51:13] 07Puppet, 10MobileFrontend (Tracking): Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425#11308125 (10Krinkle) 05Open→03Resolved [19:54:15] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:54:32] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:54:57] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:55:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:55:57] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2001.codfw.wmnet [19:55:59] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:56:05] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:56:11] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:57:31] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [19:57:35] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11308132 (10mpopov) >>! In T408164#11307183, @JMoore-WM... [19:59:21] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [19:59:27] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [19:59:27] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:59:28] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [19:59:31] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [19:59:36] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:03:00] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:03:05] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:03:05] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:03:06] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [20:03:09] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [20:03:18] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy2001.codfw.wmnet [20:05:49] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11308166 (10JMoore-WMF) i can't access datahub or super... [20:06:36] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11308167 (10mpopov) Right, the patch Ben uploaded hasn'... [20:13:03] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:15:43] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:21:23] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2005-dev.codfw.wmnet with OS trixie [20:22:43] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [20:22:52] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [20:23:16] (03PS1) 10JHathaway: dmarc: add dmarc monitoring records to more domains [dns] - 10https://gerrit.wikimedia.org/r/1198598 (https://phabricator.wikimedia.org/T404884) [20:24:19] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [20:24:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [20:25:27] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:25:27] (03PS2) 10JHathaway: dmarc: add dmarc monitoring records to more domains [dns] - 10https://gerrit.wikimedia.org/r/1198598 (https://phabricator.wikimedia.org/T404884) [20:29:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [20:29:48] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [20:30:36] 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11308242 (10Dzahn) > @Dzahn, would you agree that these are the main points of confusion? Did I miss anything? Yes, I agree and can generally confirm 2 things: The majority... [20:36:36] (03CR) 10Bking: [C:03+2] "self-merging after confirming that the change works from a homedir deploy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198594 (https://phabricator.wikimedia.org/T362114) (owner: 10Bking) [20:36:38] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2001.codfw.wmnet [20:36:40] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:41:52] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:41:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:41:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:41:58] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [20:42:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [20:42:06] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:45:33] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:45:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:45:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:45:40] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [20:45:43] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [20:45:53] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy2001.codfw.wmnet [20:55:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:56:23] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30038 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:57:42] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy4001.ulsfo.wmnet [20:57:44] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:59:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:01:12] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy4001.ulsfo.wmnet - dzahn@cumin2002" [21:01:12] (03CR) 10Ryan Kemper: [C:03+2] wdqs.data-transfer: --force behavior now mandatory [cookbooks] - 10https://gerrit.wikimedia.org/r/1198399 (https://phabricator.wikimedia.org/T408163) (owner: 10Ryan Kemper) [21:01:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy4001.ulsfo.wmnet - dzahn@cumin2002" [21:01:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:01:17] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy4001.ulsfo.wmnet on all recursors [21:01:20] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy4001.ulsfo.wmnet on all recursors [21:01:46] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy4001.ulsfo.wmnet - dzahn@cumin2002" [21:01:52] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy4001.ulsfo.wmnet - dzahn@cumin2002" [21:05:07] dzahn@cumin2002 makevm (PID 490264) is awaiting input [21:05:19] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy4001.ulsfo.wmnet with OS trixie [21:05:34] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11308308 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy400... [21:08:21] (03Merged) 10jenkins-bot: wdqs.data-transfer: --force behavior now mandatory [cookbooks] - 10https://gerrit.wikimedia.org/r/1198399 (https://phabricator.wikimedia.org/T408163) (owner: 10Ryan Kemper) [21:08:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:10:32] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2006-dev.codfw.wmnet with OS trixie [21:13:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:18:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:22:23] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [21:22:55] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97) [21:26:55] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage [21:26:59] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage [21:28:22] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy4001.ulsfo.wmnet with reason: host reimage [21:34:12] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy4001.ulsfo.wmnet with reason: host reimage [21:36:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11308360 (10VRiley-WMF) Created a ticket for ms-be1090 for Supermicro to assist Case #00061744 [21:41:14] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:44:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:27] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:51:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:51:29] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy4001.ulsfo.wmnet with OS trixie [21:51:30] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy4001.ulsfo.wmnet [21:51:44] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11308370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy4001.ul... [21:52:51] !log [WDQS] We're experiencing intermittent difficulty keeping up with the volume of updates. We've started seeing a very large spike in new triples around `2025-10-24 20:09:00` [21:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:29] !log [WDQS] See https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs-main&from=2025-10-24T20:06:49.223Z&to=2025-10-24T21:51:54.665Z&timezone=utc&var-graph_type=%289102%7C919%5B35%5D%29&viewPanel=panel-7 for initial 2 hours of instability [21:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:54] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2006-dev.codfw.wmnet with OS trixie [21:55:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:56:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:02:32] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:05:35] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11308421 (10Novem_Linguae) >>! In T408164#11308166, @JM... [22:09:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:10:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:03] !incidents [22:11:03] 6900 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [22:11:08] !ack 6900 [22:11:09] 6900 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [22:12:07] ok got to a computer looking [22:12:25] hmm seems to be tapering off but yeah [22:12:32] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:12:36] looking further [22:14:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:14:54] ha well [22:15:10] I mean you see all the symptoms but I haven't really found the cause [22:15:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:20:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:51] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11308464 (10JMoore-WMF) i don't see myself in this list... [22:24:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:27:47] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11308479 (10Novem_Linguae) Got it. My guess is your wmf... [23:20:27] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:et-0/0/30 (Core: ssw1-d8-eqiad:ethernet-1/30 {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:30:27] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198605 [23:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198605 (owner: 10TrainBranchBot) [23:50:27] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:51:01] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198605 (owner: 10TrainBranchBot)