[00:00:01] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197366 (owner: 10TrainBranchBot) [00:03:03] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1011.eqiad.wmnet [00:03:29] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:04:40] RESOLVED: DiskSpace: Disk space ml-serve1012:9100:/ 3.068% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:04:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:05:44] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [00:08:29] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:09:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:14:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:16:25] !log sudo ipmitool -I lanplus -H "cp3073.mgmt.esams.wmnet" -U root -E chassis power cycle [00:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:20] RECOVERY - Host cp3073 is UP: PING OK - Packet loss = 0%, RTA = 80.00 ms [00:22:36] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db-test1003.eqiad.wmnet with OS trixie [00:22:40] PROBLEM - haproxy process on cp3073 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [00:23:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:23:39] cp3073 is depooled so no issues [00:23:42] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3073 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [00:23:42] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3073 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [00:25:40] RECOVERY - haproxy process on cp3073 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [00:25:42] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3073 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2025-11-14 05:58:19 +0000 (expires in 23 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:25:42] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3073 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-01-07 23:02:02 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:28:10] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:31:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:33:10] FIRING: [7x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:34:28] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cp3073.esams.wmnet with reason: depooled [00:38:10] RESOLVED: [8x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:41:39] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [00:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:00:30] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11296745 (10Andrew) With preseed-test I get different but also bad behavior. Grub works, but the kernel won't boot: ` Loading Linux 6.12.43+deb13-amd64 ... Loading initial ramdisk ...... [01:05:45] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11296749 (10Andrew) I really need a second config B (or at least 4-drive sw raid) prod server to test this on. [01:29:15] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:31:32] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:34:26] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 4.462 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [02:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:20:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:14:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197722 (owner: 10Tim Starling) [03:18:08] (03Merged) 10jenkins-bot: recentchanges: Temporary fix for incubator exception [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197722 (owner: 10Tim Starling) [03:19:09] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1197722|recentchanges: Temporary fix for incubator exception]] [03:23:34] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:23:42] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1197722|recentchanges: Temporary fix for incubator exception]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [03:24:24] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30040 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:24:36] !log tstarling@deploy2002 tstarling: Continuing with sync [03:28:47] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197722|recentchanges: Temporary fix for incubator exception]] (duration: 09m 38s) [03:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:54:15] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:05:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [04:09:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:36] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:36:57] !log repooling cp3073 after reboot and removing downtime (T407110) [04:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:14] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3073.esams.wmnet [04:37:30] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp3073.esams.wmnet [04:37:30] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp3073.esams.wmnet [04:46:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:47:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:52:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:20:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:25:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:29:15] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:33:35] (03CR) 10Giuseppe Lavagetto: [C:03+1] deployment_server: Prefix `helmfile apply` output with "[service env]" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192282 (owner: 10RLazarus) [05:34:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:57] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM, the comment can fully be ignored as it's a volans-like nitpick on coding style." [puppet] - 10https://gerrit.wikimedia.org/r/1195352 (owner: 10RLazarus) [05:47:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11297060 (10Marostegui) [05:52:00] (03PS1) 10Marostegui: mariadb: Add db1264 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1197746 (https://phabricator.wikimedia.org/T407897) [05:53:12] (03PS2) 10Marostegui: mariadb: Add db1264 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1197746 (https://phabricator.wikimedia.org/T407897) [05:56:04] (03CR) 10Marostegui: [C:03+2] mariadb: Add db1264 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1197746 (https://phabricator.wikimedia.org/T407897) (owner: 10Marostegui) [05:56:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11297083 (10Marostegui) >>! In T407897#11295732, @Jhancock.wm wrote: > @Marostegui could you or someone else on the team fill in the needed info for this task and make a... [05:57:08] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11297084 (10Marostegui) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T0600) [06:01:34] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:03:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 8.941 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:04:50] (03PS1) 10Marostegui: mariadb: Add db2249 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1197750 (https://phabricator.wikimedia.org/T407941) [06:08:03] (03CR) 10Marostegui: [C:03+2] mariadb: Add db2249 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1197750 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui) [06:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:20:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:05] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1197067 (owner: 10L10n-bot) [06:31:39] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1197247 (owner: 10L10n-bot) [06:34:30] (03PS1) 10Marostegui: mariadb: Add db1265-db1298 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1197753 (https://phabricator.wikimedia.org/T405273) [06:40:05] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11297139 (10RKemper) >>! In T393966#11201576, @elukey wrote: > @Gehel @RKemper Hi! A while ago I had a chat wit... [06:42:47] (03CR) 10Marostegui: [C:03+2] mariadb: Add db1265-db1298 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1197753 (https://phabricator.wikimedia.org/T405273) (owner: 10Marostegui) [06:45:58] FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:46:10] woot [06:46:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [06:46:23] !incidents [06:46:23] 6897 (UNACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [06:46:24] 6898 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:46:28] !ack 6897 [06:46:28] 6897 (ACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [06:46:29] !ack 6898 [06:46:30] FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 5 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [06:46:30] 6898 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:46:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11297146 (10Marostegui) [06:50:57] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:51:04] !incidents [06:51:05] 6897 (ACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [06:51:05] 6898 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:51:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 7 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [06:52:27] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1197672 (owner: 10Dpogorzelski) [06:53:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:54:12] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:54:24] (03CR) 10Filippo Giunchedi: "LGTM, I'll let o11y folks vote though" [puppet] - 10https://gerrit.wikimedia.org/r/1197590 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah) [06:54:27] !incidents [06:54:27] 6897 (ACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [06:54:28] 6898 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:54:38] (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::metricsinfra: Fix thanos::rule usage [puppet] - 10https://gerrit.wikimedia.org/r/1197591 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah) [06:55:00] (03CR) 10Filippo Giunchedi: [C:03+2] cloudceph: set mtu only when interfaces exist [puppet] - 10https://gerrit.wikimedia.org/r/1197245 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [06:55:21] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1197696 (https://phabricator.wikimedia.org/T406927) (owner: 10Kamila Součková) [07:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T0700). nyaa~ [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11297158 (10elukey) @Dzahn Hello :) There is no need for apologies, I didn't take it in the bad way, what I was trying to convey is tha... [07:01:36] (03PS1) 10Krinkle: fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) [07:03:43] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: upgrade trixie hosts to ROCm 7.0.2 repos [puppet] - 10https://gerrit.wikimedia.org/r/1197602 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [07:03:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:04:12] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:58] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:06:42] (03PS13) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) [07:09:07] (03PS1) 10Jelto: aptrepo: update gitlab-ce and gitlab-runner to 18.3 [puppet] - 10https://gerrit.wikimedia.org/r/1197909 (https://phabricator.wikimedia.org/T407943) [07:10:57] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:11:30] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 3 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [07:16:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:17:30] (03CR) 10Jelto: [C:03+2] aptrepo: update gitlab-ce and gitlab-runner to 18.3 [puppet] - 10https://gerrit.wikimedia.org/r/1197909 (https://phabricator.wikimedia.org/T407943) (owner: 10Jelto) [07:17:44] (03CR) 10Elukey: [C:03+2] Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [07:18:09] (03PS1) 10Marostegui: db1264: Add 1P note [puppet] - 10https://gerrit.wikimedia.org/r/1197924 [07:18:58] (03CR) 10Marostegui: [C:03+2] db1264: Add 1P note [puppet] - 10https://gerrit.wikimedia.org/r/1197924 (owner: 10Marostegui) [07:19:10] elukey: ok to merge? [07:19:12] FIRING: [3x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:18] marostegui: yep! [07:19:27] doing it now [07:21:14] (03CR) 10Volans: deployment_server: Refactor charlie to add a Service dataclass (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195352 (owner: 10RLazarus) [07:21:36] (03PS1) 10Marostegui: db1262.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197926 (https://phabricator.wikimedia.org/T406550) [07:22:29] (03CR) 10Marostegui: [C:03+2] db1262.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197926 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [07:24:13] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:29:13] jouncebot: now [07:29:14] For the next 0 hour(s) and 30 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T0700) [07:30:28] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr2-drmrs:9804) - https://phabricator.wikimedia.org/T407945 (10LSobanski) 03NEW [07:30:49] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407946 (10LSobanski) 03NEW [07:31:28] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11297227 (10fgiunchedi) FWIW yesterday while testing the preseed-test fix for many drives (https://gitlab.wikimedia.org/repos/sre/preseed-test/-/merge_requests/6) I was able to install... [07:31:45] (03PS3) 10Dpogorzelski: feat: add dpogorzelski user [puppet] - 10https://gerrit.wikimedia.org/r/1197672 [07:44:08] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr2-drmrs:9804) - https://phabricator.wikimedia.org/T407945#11297243 (10cmooney) 05Open→03Resolved a:03cmooney There are other peers to that ASN, these not establishing. Removed. [07:44:55] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407946#11297248 (10cmooney) 05Open→03Resolved a:03cmooney There are other sessions to that ASN but they have not configured these two.... [07:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:49:25] (03PS1) 10Marostegui: instances.yaml: Add db1262 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197977 (https://phabricator.wikimedia.org/T406550) [07:49:58] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1262 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197977 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [07:50:14] (03PS1) 10Dpogorzelski: chore: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1197978 [07:52:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db1262 depooled T406550', diff saved to https://phabricator.wikimedia.org/P84211 and previous config saved to /var/cache/conftool/dbconfig/20251022-075234-marostegui.json [07:52:39] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [07:54:15] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:55:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84212 and previous config saved to /var/cache/conftool/dbconfig/20251022-075508-root.json [07:55:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:57:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) (owner: 10Seanleong-wmde) [07:57:58] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005 [07:58:05] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest1005 [07:59:02] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005 [07:59:02] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1005 [07:59:31] (03PS1) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) [08:00:05] jelto and hashar: Deploy window Gerrit server reboot (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T0800) [08:00:05] dancy and andre: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T0800). [08:00:20] jelto: I am around :) [08:00:42] (03PS2) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) [08:00:45] I'm also around, we can coordinate here. Or do you prefer the meet session? [08:01:08] meet will work for me as well :) [08:01:16] (03PS1) 10Marostegui: mariadb: Decommission es1029 [puppet] - 10https://gerrit.wikimedia.org/r/1197980 (https://phabricator.wikimedia.org/T407832) [08:02:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1029.eqiad.wmnet [08:02:25] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005 [08:02:34] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1005 [08:04:29] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission es1029 [puppet] - 10https://gerrit.wikimedia.org/r/1197980 (https://phabricator.wikimedia.org/T407832) (owner: 10Marostegui) [08:08:02] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gerrit1003.wikimedia.org [08:08:41] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [08:09:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84213 and previous config saved to /var/cache/conftool/dbconfig/20251022-081014-root.json [08:13:49] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1029.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [08:14:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1029.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [08:14:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:14:11] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts es1029.eqiad.wmnet [08:14:12] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:14:13] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:58] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832#11297337 (10Marostegui) #dc-ops this host is ready. However the host is still UP due to a ipmi connection failure, but the rest of things have been done and you can proceed to... [08:15:06] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832#11297341 (10Marostegui) [08:17:07] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit1003.wikimedia.org [08:19:12] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:19:13] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:18] (03PS1) 10Giuseppe Lavagetto: varnish: add XCHS based browser detection routine [puppet] - 10https://gerrit.wikimedia.org/r/1197986 (https://phabricator.wikimedia.org/T404826) [08:20:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:10] (03CR) 10Michael Große: [C:03+1] fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [08:25:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 7%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84214 and previous config saved to /var/cache/conftool/dbconfig/20251022-082521-root.json [08:28:25] (03PS2) 10Giuseppe Lavagetto: varnish: add XCHS based browser detection routine [puppet] - 10https://gerrit.wikimedia.org/r/1197986 (https://phabricator.wikimedia.org/T404826) [08:31:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1053 to es2 primary as es1030 will be decommissioned T406690 T407953', diff saved to https://phabricator.wikimedia.org/P84215 and previous config saved to /var/cache/conftool/dbconfig/20251022-083134-marostegui.json [08:31:41] T406690: Decommission es1026 - es1034 - https://phabricator.wikimedia.org/T406690 [08:31:41] T407953: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953 [08:31:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1030 T407953', diff saved to https://phabricator.wikimedia.org/P84216 and previous config saved to /var/cache/conftool/dbconfig/20251022-083153-marostegui.json [08:32:55] (03PS1) 10Marostegui: es1030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197990 (https://phabricator.wikimedia.org/T407953) [08:33:26] (03CR) 10Marostegui: [C:03+2] es1030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197990 (https://phabricator.wikimedia.org/T407953) (owner: 10Marostegui) [08:36:19] (03PS1) 10Marostegui: installserver: Remove es1052 [puppet] - 10https://gerrit.wikimedia.org/r/1197993 [08:36:36] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:38:30] (03CR) 10Marostegui: [C:03+2] installserver: Remove es1052 [puppet] - 10https://gerrit.wikimedia.org/r/1197993 (owner: 10Marostegui) [08:38:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 6.439 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:40:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84217 and previous config saved to /var/cache/conftool/dbconfig/20251022-084027-root.json [08:40:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:41:02] (03CR) 10Dpogorzelski: [C:03+1] feat: add dpogorzelski user [puppet] - 10https://gerrit.wikimedia.org/r/1197672 (owner: 10Dpogorzelski) [08:47:45] (03CR) 10Klausman: [C:03+2] feat: add dpogorzelski user [puppet] - 10https://gerrit.wikimedia.org/r/1197672 (owner: 10Dpogorzelski) [08:48:30] (03CR) 10Cathal Mooney: [C:03+2] Add Nokia devices to common.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1196704 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [08:49:10] (03CR) 10Vgutierrez: [C:03+1] "PCC looks good: https://puppet-compiler.wmflabs.org/output/1197986/7366/" [puppet] - 10https://gerrit.wikimedia.org/r/1197986 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [08:54:05] (03CR) 10Klausman: [C:03+1] chore: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1197978 (owner: 10Dpogorzelski) [08:54:09] (03PS3) 10Federico Ceratto: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) [08:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:55:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 20%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84218 and previous config saved to /var/cache/conftool/dbconfig/20251022-085533-root.json [08:56:04] (03PS4) 10Federico Ceratto: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) [08:59:20] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1197978 (owner: 10Dpogorzelski) [08:59:38] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:00:55] (03CR) 10Vgutierrez: [C:03+2] haproxy: Deploy private data files and set lua-prepend-path [puppet] - 10https://gerrit.wikimedia.org/r/1197681 (owner: 10Vgutierrez) [09:01:13] (03CR) 10Klausman: [C:03+2] chore: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1197978 (owner: 10Dpogorzelski) [09:01:30] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30036 bytes in 0.374 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:01:42] (03CR) 10Slyngshede: [C:03+1] preseed.yaml: expand regex for sretest100x to include 1005/1006 [puppet] - 10https://gerrit.wikimedia.org/r/1197678 (https://phabricator.wikimedia.org/T405560) (owner: 10Cathal Mooney) [09:02:52] (03PS1) 10Cathal Mooney: Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) [09:04:32] (03CR) 10CI reject: [V:04-1] Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:04:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Reduce weight for db2245 - which was wrong', diff saved to https://phabricator.wikimedia.org/P84219 and previous config saved to /var/cache/conftool/dbconfig/20251022-090437-marostegui.json [09:06:45] (03CR) 10Btullis: "You will need to bump the chart version, since this is not an override in the helm values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene) [09:09:39] (03PS3) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) [09:09:51] (03PS2) 10Cathal Mooney: Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) [09:10:06] (03CR) 10Cathal Mooney: [C:03+2] preseed.yaml: expand regex for sretest100x to include 1005/1006 [puppet] - 10https://gerrit.wikimedia.org/r/1197678 (https://phabricator.wikimedia.org/T405560) (owner: 10Cathal Mooney) [09:10:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84220 and previous config saved to /var/cache/conftool/dbconfig/20251022-091039-root.json [09:11:09] (03CR) 10CI reject: [V:04-1] Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:12:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11297474 (10elukey) @Papaul the issue comes before debian and partman, because when I try to provision the host there is no "hard-disk" option to put as... [09:12:22] (03CR) 10Btullis: superset: Increase the nginx proxy timeout (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene) [09:13:44] (03CR) 10Tiziano Fogli: [C:03+2] k8s/client_cert: adjust Prometheus certificate renewal timing [puppet] - 10https://gerrit.wikimedia.org/r/1197303 (https://phabricator.wikimedia.org/T407484) (owner: 10Tiziano Fogli) [09:14:38] (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:15:37] (03PS1) 10David Caro: dcaro: remove unused old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1198002 [09:16:26] tappof: can I puppet-merge your pending change "Temporarily longer client certs - https://phabricator.wikimedia.org/T343529" [09:16:40] yes federico3, thx [09:18:24] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1198002 (owner: 10David Caro) [09:20:08] (03PS4) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) [09:20:33] (03PS1) 10Vgutierrez: Revert "chore: add dpogorzelski to ops-limited" [puppet] - 10https://gerrit.wikimedia.org/r/1198003 [09:21:28] (03CR) 10David Caro: [C:03+2] dcaro: remove unused old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1198002 (owner: 10David Caro) [09:21:52] (03CR) 10Vgutierrez: [C:03+2] Revert "chore: add dpogorzelski to ops-limited" [puppet] - 10https://gerrit.wikimedia.org/r/1198003 (owner: 10Vgutierrez) [09:22:24] dcaro: merge mine if it's showing on your puppet-merge session [09:22:42] vgutierrez: it did not, almost finished [09:22:47] thx [09:23:00] you can go now :) [09:23:25] (03PS3) 10Cathal Mooney: Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) [09:23:55] (03PS1) 10Marostegui: instances.yaml: Remove es1030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1198004 (https://phabricator.wikimedia.org/T407953) [09:25:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2057.codfw.wmnet with reason: Setting up new ES host [09:25:40] (03PS1) 10MVernon: swift: remove ms-be10{89,90} for controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1198005 (https://phabricator.wikimedia.org/T400877) [09:25:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 30%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84221 and previous config saved to /var/cache/conftool/dbconfig/20251022-092545-root.json [09:26:20] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1198004 (https://phabricator.wikimedia.org/T407953) (owner: 10Marostegui) [09:27:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1030 from dbctl T407953', diff saved to https://phabricator.wikimedia.org/P84222 and previous config saved to /var/cache/conftool/dbconfig/20251022-092747-marostegui.json [09:27:52] T407953: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953 [09:28:39] (03PS4) 10Cathal Mooney: Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) [09:30:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:31:56] 06SRE, 10SRE-Access-Requests: Requesting access to production for dpogorzelski - https://phabricator.wikimedia.org/T407955 (10DPogorzelski-WMF) 03NEW [09:32:21] (03PS1) 10Marostegui: db1251: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198006 (https://phabricator.wikimedia.org/T407463) [09:32:57] (03CR) 10Marostegui: [C:03+2] db1251: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198006 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [09:34:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1251.eqiad.wmnet with reason: Maintenance [09:34:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1251 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84223 and previous config saved to /var/cache/conftool/dbconfig/20251022-093413-marostegui.json [09:35:12] (03CR) 10Cathal Mooney: [C:03+1] Interface validators: prevent more mistakes on interface naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi) [09:37:06] (03PS2) 10Ayounsi: Interface validators: prevent more mistakes on interface naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) [09:38:43] (03CR) 10Hnowlan: [C:03+2] "Yeah, these are not publicly routed. The rules were added before the inconsistency between GET and POST behaviour in the corresponding res" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112797 (https://phabricator.wikimedia.org/T384216) (owner: 10Hnowlan) [09:39:34] (03PS1) 10Cathal Mooney: config_switch_interfaces: force homer usage if switch is a Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198008 [09:40:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84224 and previous config saved to /var/cache/conftool/dbconfig/20251022-094051-root.json [09:41:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:42:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1251 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84225 and previous config saved to /var/cache/conftool/dbconfig/20251022-094213-root.json [09:47:59] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone_es of es2034.codfw.wmnet onto es2057.codfw.wmnet [09:48:00] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005 [09:48:04] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2034 - Depool es2034.codfw.wmnet to then clone it to es2057.codfw.wmnet - fceratto@cumin1003 [09:48:09] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1005 [09:48:21] (03PS5) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) [09:48:22] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2034 - Depool es2034.codfw.wmnet to then clone it to es2057.codfw.wmnet - fceratto@cumin1003 [09:48:39] (03CR) 10Vgutierrez: [C:03+2] varnish: add XCHS based browser detection routine [puppet] - 10https://gerrit.wikimedia.org/r/1197986 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [09:50:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on es1030.eqiad.wmnet with reason: Decommissioning [09:50:12] !log Stop mariadb on es1030 for decommissioning T407953 [09:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:19] T407953: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953 [09:51:22] fceratto@cumin1003 clone_es (PID 2441330) is awaiting input [09:51:48] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005 [09:51:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:52:09] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1005 [09:52:48] (03CR) 10Jcrespo: [C:03+1] swift: remove ms-be10{89,90} for controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1198005 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [09:53:58] (03CR) 10Btullis: superset: Increase the nginx proxy timeout (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene) [09:55:40] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:55:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 60%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84228 and previous config saved to /var/cache/conftool/dbconfig/20251022-095557-root.json [09:56:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:57:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1251 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84229 and previous config saved to /var/cache/conftool/dbconfig/20251022-095719-root.json [09:57:35] (03PS1) 10Marostegui: db1263: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1198009 (https://phabricator.wikimedia.org/T406550) [09:58:32] (03PS6) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) [09:58:46] cmooney@cumin1003 provision (PID 2450330) is awaiting input [09:58:58] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:59:16] (03CR) 10Marostegui: [C:03+2] db1263: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1198009 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [09:59:51] (03PS1) 10Elukey: profile::pyrra: add two Xlab SLOs under the data-platform namespace [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1000) [10:01:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:02:12] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Set Alias entity usage modifier limit to 10. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) (owner: 10Seanleong-wmde) [10:05:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:06:17] (03PS1) 10Marostegui: instances.yaml: Add db1263 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1198012 (https://phabricator.wikimedia.org/T406550) [10:07:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:07:18] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1263 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1198012 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [10:08:04] (03CR) 10MVernon: [C:03+2] swift: remove ms-be10{89,90} for controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1198005 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [10:08:48] (03CR) 10Kamila Součková: [C:03+2] Record LDAP access for lsandergreen. [puppet] - 10https://gerrit.wikimedia.org/r/1197696 (https://phabricator.wikimedia.org/T406927) (owner: 10Kamila Součková) [10:09:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db1263 to dbctl depooled T406550', diff saved to https://phabricator.wikimedia.org/P84230 and previous config saved to /var/cache/conftool/dbconfig/20251022-100920-marostegui.json [10:09:26] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [10:10:07] (03CR) 10Kamila Součková: [C:03+1] admin: add yubikey ed25519-sk ssh key to user dzahn [puppet] - 10https://gerrit.wikimedia.org/r/1197720 (https://phabricator.wikimedia.org/T407917) (owner: 10Dzahn) [10:11:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84231 and previous config saved to /var/cache/conftool/dbconfig/20251022-101103-root.json [10:11:37] (03CR) 10Kamila Součková: [C:03+2] url_downloader: remove hcaptcha proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [10:11:43] (03PS3) 10Effie Mouzeli: etcd::tlsproxy: Remove testserver ACLs 2 [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498) [10:12:06] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:12:11] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [10:12:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1251 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84232 and previous config saved to /var/cache/conftool/dbconfig/20251022-101225-root.json [10:12:28] (03PS3) 10Effie Mouzeli: conftool-data: remove testservers 3 [puppet] - 10https://gerrit.wikimedia.org/r/1173877 (https://phabricator.wikimedia.org/T397498) [10:14:44] (03PS7) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) [10:14:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:16:14] (03CR) 10Btullis: superset: Increase the nginx proxy timeout (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene) [10:17:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:19:50] (03PS8) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) [10:22:06] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:24:44] (03PS28) 10Clément Goubert: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [10:25:45] (03PS4) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) [10:26:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 1000%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84233 and previous config saved to /var/cache/conftool/dbconfig/20251022-102609-root.json [10:27:07] (03PS2) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) [10:27:29] (03CR) 10Phuedx: Add config for xLab MW Module experiment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [10:27:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1251 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84234 and previous config saved to /var/cache/conftool/dbconfig/20251022-102732-root.json [10:27:32] (03CR) 10Phuedx: [C:03+1] Add config for xLab MW Module experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [10:27:46] !log kamila@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic (T405631) [10:28:14] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:28:25] FIRING: SystemdUnitFailed: cfssl-ocsprefresh-wikikube_staging.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:37] (03PS2) 10Elukey: profile::pyrra: add two Xlab SLOs under the data-platform namespace [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869) [10:29:11] !log kamila@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic (T405631) [10:31:06] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:31:13] (03CR) 10Elukey: "Left a comment related to the numerator metrics, lemme know :)" [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [10:31:18] (03PS1) 10Marco Fossati: Deploy the ReaderExperiments extension to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198016 (https://phabricator.wikimedia.org/T406907) [10:31:38] (03PS5) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) [10:32:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198016 (https://phabricator.wikimedia.org/T406907) (owner: 10Marco Fossati) [10:35:02] jouncebot: nowandnext [10:35:02] For the next 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1000) [10:35:02] In 0 hour(s) and 24 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1100) [10:35:11] Anyone using this window? [10:35:49] (03PS1) 10Dreamy Jazz: Fix abuse_filter_log index in TempUserIPLookup [extensions/IPInfo] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198017 (https://phabricator.wikimedia.org/T400280) [10:35:57] (03PS1) 10Dreamy Jazz: Fix abuse_filter_log index in TempUserIPLookup [extensions/IPInfo] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198018 (https://phabricator.wikimedia.org/T400280) [10:38:19] (03PS4) 10Effie Mouzeli: conftool-data: remove testservers 3 [puppet] - 10https://gerrit.wikimedia.org/r/1173877 (https://phabricator.wikimedia.org/T397498) [10:38:20] (03PS1) 10Effie Mouzeli: scap: remove testservers 4 [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) [10:38:41] Going to proceed with a deploy [10:38:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/IPInfo] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198017 (https://phabricator.wikimedia.org/T400280) (owner: 10Dreamy Jazz) [10:38:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/IPInfo] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198018 (https://phabricator.wikimedia.org/T400280) (owner: 10Dreamy Jazz) [10:39:09] I should be able to abort that if someone needs scap for the window in the next few mins [10:39:53] !log kamila@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad (T405631) [10:40:25] (03CR) 10Stevemunene: "> You will need to bump the chart version, since this is not an override in the helm values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene) [10:40:26] !log kamila@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad (T405631) [10:40:31] (03CR) 10CI reject: [V:04-1] scap: remove testservers 4 [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [10:41:01] (03PS2) 10Effie Mouzeli: scap: remove testservers 4 [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) [10:43:11] (03CR) 10Elukey: [C:03+1] config_switch_interfaces: force homer usage if switch is a Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198008 (owner: 10Cathal Mooney) [10:44:14] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [10:45:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Decrease db1262 weight', diff saved to https://phabricator.wikimedia.org/P84235 and previous config saved to /var/cache/conftool/dbconfig/20251022-104530-marostegui.json [10:46:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Decrease es2028 weight', diff saved to https://phabricator.wikimedia.org/P84236 and previous config saved to /var/cache/conftool/dbconfig/20251022-104601-marostegui.json [10:47:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84237 and previous config saved to /var/cache/conftool/dbconfig/20251022-104724-root.json [10:48:18] !log kamila@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T405631) [10:48:46] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:49:01] !log kamila@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T405631) [10:50:55] (03PS1) 10Marostegui: db2146: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198020 (https://phabricator.wikimedia.org/T407463) [10:50:57] (03Merged) 10jenkins-bot: Fix abuse_filter_log index in TempUserIPLookup [extensions/IPInfo] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198017 (https://phabricator.wikimedia.org/T400280) (owner: 10Dreamy Jazz) [10:51:35] (03CR) 10Marostegui: [C:03+2] db2146: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198020 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [10:51:41] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:52:14] (03CR) 10Cathal Mooney: [C:03+2] config_switch_interfaces: force homer usage if switch is a Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198008 (owner: 10Cathal Mooney) [10:52:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2146.codfw.wmnet with reason: Maintenance [10:52:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2146 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84238 and previous config saved to /var/cache/conftool/dbconfig/20251022-105255-marostegui.json [10:54:01] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [10:54:22] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [10:57:45] (03Merged) 10jenkins-bot: Fix abuse_filter_log index in TempUserIPLookup [extensions/IPInfo] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198018 (https://phabricator.wikimedia.org/T400280) (owner: 10Dreamy Jazz) [10:57:47] (03Merged) 10jenkins-bot: config_switch_interfaces: force homer usage if switch is a Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198008 (owner: 10Cathal Mooney) [10:58:10] (03CR) 10Clément Goubert: "Last patch was a rebase only patch, restoring the +1 from @hnowlan@wikimedia.org" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [10:58:24] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [10:58:32] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1198017|Fix abuse_filter_log index in TempUserIPLookup (T400280)]], [[gerrit:1198018|Fix abuse_filter_log index in TempUserIPLookup (T400280)]] [10:58:37] T400280: Drop `afl_ip` as the last step of the migration to `afl_ip_hex` - https://phabricator.wikimedia.org/T400280 [10:58:52] (03CR) 10Hnowlan: "lgtm. some musings/nits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [10:59:53] (03Merged) 10jenkins-bot: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [10:59:56] (03CR) 10Effie Mouzeli: [C:03+2] etcd::tlsproxy: Remove testserver ACLs 2 [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [11:00:04] mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1100). [11:00:19] (03PS1) 10Kosta Harlan: EventStreamConfig: Don't collect user-agent for suggested_investigations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198021 (https://phabricator.wikimedia.org/T404177) [11:00:48] (03CR) 10Effie Mouzeli: [C:03+2] conftool-data: remove testservers 3 [puppet] - 10https://gerrit.wikimedia.org/r/1173877 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [11:01:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84239 and previous config saved to /var/cache/conftool/dbconfig/20251022-110111-root.json [11:02:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84240 and previous config saved to /var/cache/conftool/dbconfig/20251022-110230-root.json [11:03:01] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1198017|Fix abuse_filter_log index in TempUserIPLookup (T400280)]], [[gerrit:1198018|Fix abuse_filter_log index in TempUserIPLookup (T400280)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:03:13] (03CR) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [11:04:06] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [11:04:23] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [11:04:40] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [11:05:39] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:05:44] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:06:32] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2203.codfw.wmnet [11:07:36] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:08:06] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:08:34] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198017|Fix abuse_filter_log index in TempUserIPLookup (T400280)]], [[gerrit:1198018|Fix abuse_filter_log index in TempUserIPLookup (T400280)]] (duration: 10m 01s) [11:08:38] T400280: Drop `afl_ip` as the last step of the migration to `afl_ip_hex` - https://phabricator.wikimedia.org/T400280 [11:08:48] I'm done with my deploy [11:09:11] !log kamila@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host wikikube-worker2203.codfw.wmnet [11:09:28] (03PS2) 10Michael Große: beta: Enable ReviseTone Structured Task on enwiki,frwiki,arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) [11:10:29] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196638 (owner: 10PipelineBot) [11:10:53] (03PS9) 10Btullis: Migrate the refine_netflow job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) [11:12:02] (03PS2) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196672 [11:12:20] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196638 (owner: 10PipelineBot) [11:12:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:37] (03CR) 10Dreamy Jazz: [C:03+1] "Thanks, didn't notice this part was needed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198021 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan) [11:12:43] (03CR) 10Btullis: [C:03+2] Migrate the refine_netflow job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [11:12:58] jouncebot: nowandnext [11:12:58] For the next 0 hour(s) and 47 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1100) [11:12:58] In 1 hour(s) and 47 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1300) [11:13:14] Anyone using this window too? [11:13:21] Got another backport, but should be shorter [11:14:06] I was planning to use it, but it shouldn't interfere with a mediawiki deploy since it's k8 [11:14:17] Okay, thanks [11:14:31] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [11:14:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198021 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan) [11:14:50] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [11:14:59] I'm deploying some api/rest gateway patches but shouldn't interfere either [11:15:12] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [11:15:24] Thanks, this one should go faster [11:15:38] (03Merged) 10jenkins-bot: EventStreamConfig: Don't collect user-agent for suggested_investigations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198021 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan) [11:15:44] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:15:49] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:15:50] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [11:16:03] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:16:11] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1198021|EventStreamConfig: Don't collect user-agent for suggested_investigations_interaction (T404177)]] [11:16:16] T404177: Instrumentation for Suggested Investigations - https://phabricator.wikimedia.org/T404177 [11:16:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84241 and previous config saved to /var/cache/conftool/dbconfig/20251022-111617-root.json [11:16:34] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1006 [11:16:39] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1006 [11:17:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 7%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84242 and previous config saved to /var/cache/conftool/dbconfig/20251022-111736-root.json [11:18:18] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:18:33] (03PS4) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196672 (owner: 10PipelineBot) [11:19:57] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:20:16] (03PS1) 10Cathal Mooney: sre.hosts.provision: adjust to always use Homer to config Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198026 [11:20:20] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:20:24] !log dreamyjazz@deploy2002 kharlan, dreamyjazz: Backport for [[gerrit:1198021|EventStreamConfig: Don't collect user-agent for suggested_investigations_interaction (T404177)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:20:53] !log dreamyjazz@deploy2002 kharlan, dreamyjazz: Continuing with sync [11:21:50] (03CR) 10Kamila Součková: [C:04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [11:22:08] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Deploy rate limiting in staging (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [11:22:19] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196672 (owner: 10PipelineBot) [11:24:10] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196672 (owner: 10PipelineBot) [11:24:15] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:24] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:24:53] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:25:00] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198021|EventStreamConfig: Don't collect user-agent for suggested_investigations_interaction (T404177)]] (duration: 08m 48s) [11:25:04] T404177: Instrumentation for Suggested Investigations - https://phabricator.wikimedia.org/T404177 [11:25:07] I'm done [11:25:08] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1006 [11:25:13] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1006 [11:25:15] (03PS1) 10Marostegui: db1196: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198027 (https://phabricator.wikimedia.org/T407463) [11:25:18] (03Abandoned) 10Kamila Součková: proxoid: add discovery SAN [puppet] - 10https://gerrit.wikimedia.org/r/1196954 (https://phabricator.wikimedia.org/T407615) (owner: 10Kamila Součková) [11:26:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: Upgrading [11:26:31] (03CR) 10Marostegui: [C:03+2] db1196: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198027 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [11:26:39] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:26:59] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:27:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1196.eqiad.wmnet with reason: Maintenance [11:27:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1196 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84243 and previous config saved to /var/cache/conftool/dbconfig/20251022-112732-marostegui.json [11:27:49] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:28:07] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:28:25] RESOLVED: SystemdUnitFailed: cfssl-ocsprefresh-wikikube_staging.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:30] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:30:02] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:30:31] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:30:56] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:31:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84244 and previous config saved to /var/cache/conftool/dbconfig/20251022-113123-root.json [11:32:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84245 and previous config saved to /var/cache/conftool/dbconfig/20251022-113243-root.json [11:35:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84246 and previous config saved to /var/cache/conftool/dbconfig/20251022-113521-root.json [11:37:51] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11297888 (10MatthewVernon) [11:40:10] !log mvernon@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ms-be[1089-1090].eqiad.wmnet with reason: awaiting controller swap [11:40:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11297900 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cea00150-47a1-46ce-a142-ec46d9e47678) set by mvernon@cumin1003 for 3 days, 0:... [11:40:21] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11297901 (10MatthewVernon) @VRiley-WMF the last two nodes ms-be1089 and ms-be1090 are ready for controller swap, please; I've downtimed them for a couple... [11:42:13] (03CR) 10Cathal Mooney: [C:03+2] gnmic: add collection for Nokia OSPF states [puppet] - 10https://gerrit.wikimedia.org/r/1196714 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [11:42:17] (03PS2) 10Phuedx: EventStreamConfig: Remove mediawiki.reference_previews stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197659 (https://phabricator.wikimedia.org/T242127) [11:43:37] (03PS3) 10Cathal Mooney: gnmic: Adjust BGP collection for Nokia compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1196917 (https://phabricator.wikimedia.org/T405558) [11:45:21] (03PS1) 10Phuedx: EventStreamConfig: Remove wikibase.client.interaction stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198029 (https://phabricator.wikimedia.org/T370045) [11:46:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84247 and previous config saved to /var/cache/conftool/dbconfig/20251022-114629-root.json [11:46:40] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:47:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 20%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84248 and previous config saved to /var/cache/conftool/dbconfig/20251022-114749-root.json [11:48:13] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:48:30] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [11:48:56] (03CR) 10Cathal Mooney: [C:03+2] gnmic: Adjust BGP collection for Nokia compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1196917 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [11:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:50:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84249 and previous config saved to /var/cache/conftool/dbconfig/20251022-115027-root.json [11:56:32] (03CR) 10Btullis: [C:03+1] superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene) [11:56:46] (03CR) 10Btullis: [C:03+1] superset: Increase the nginx proxy timeout (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene) [11:58:30] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:00:48] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [12:02:38] (03PS1) 10Marostegui: db1184: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198030 (https://phabricator.wikimedia.org/T407463) [12:02:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84251 and previous config saved to /var/cache/conftool/dbconfig/20251022-120256-root.json [12:03:00] (03PS9) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) [12:03:08] !log cmooney@cumin1003 START - Cookbook sre.hosts.remove-downtime for ssw1-d1-eqiad [12:03:08] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ssw1-d1-eqiad [12:05:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84252 and previous config saved to /var/cache/conftool/dbconfig/20251022-120533-root.json [12:06:25] (03CR) 10Marostegui: [C:03+2] db1184: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198030 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [12:08:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1184.eqiad.wmnet with reason: Maintenance [12:08:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1184 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84253 and previous config saved to /var/cache/conftool/dbconfig/20251022-120853-marostegui.json [12:10:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:04] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [12:11:33] (03PS1) 10Kamila Součková: admin: add lsandergreen to fr-tech-devs, add ssh [puppet] - 10https://gerrit.wikimedia.org/r/1198033 (https://phabricator.wikimedia.org/T406927) [12:12:00] (03PS1) 10Cathal Mooney: Netops BGP alert: make core bgp group names to be case insensitive [alerts] - 10https://gerrit.wikimedia.org/r/1198034 (https://phabricator.wikimedia.org/T405558) [12:12:33] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198033 (https://phabricator.wikimedia.org/T406927) (owner: 10Kamila Součková) [12:14:40] (03PS10) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [12:14:50] (03PS1) 10Effie Mouzeli: site.pp: bye bye mwdebugXXXX 5 [puppet] - 10https://gerrit.wikimedia.org/r/1198035 [12:14:56] (03PS3) 10DCausse: cirrus: enable completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197642 (https://phabricator.wikimedia.org/T404858) [12:15:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:15:48] (03PS2) 10Effie Mouzeli: site.pp: bye bye mwdebugXXXX 5 [puppet] - 10https://gerrit.wikimedia.org/r/1198035 (https://phabricator.wikimedia.org/T397498) [12:17:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84254 and previous config saved to /var/cache/conftool/dbconfig/20251022-121707-root.json [12:18:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 30%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84255 and previous config saved to /var/cache/conftool/dbconfig/20251022-121802-root.json [12:18:15] (03CR) 10Marostegui: major-upgrade.py: MariaDB major version upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [12:19:02] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:19:07] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:19:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84256 and previous config saved to /var/cache/conftool/dbconfig/20251022-122039-root.json [12:21:02] (03CR) 10CI reject: [V:04-1] major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [12:25:10] (03PS1) 10Cory Massaro: Wikifunctions: Upgrade orchestrator from 2025-10-14-194525 to 2025-10-22-011302. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198036 (https://phabricator.wikimedia.org/T381060) [12:27:43] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1198037 (https://phabricator.wikimedia.org/T407975) [12:28:32] (03PS3) 10Michael Große: beta: Enable ReviseTone Structured Task on enwiki,frwiki,arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) [12:30:08] jouncebot: nownandnext [12:31:10] (03PS6) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) [12:31:34] (03CR) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [12:32:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84257 and previous config saved to /var/cache/conftool/dbconfig/20251022-123213-root.json [12:32:21] (03CR) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [12:32:21] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:32:26] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:32:29] (03PS1) 10Cory Massaro: Update function-evaluators from 2025-10-15-120631 to 2025-10-21-143846. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198039 (https://phabricator.wikimedia.org/T381060) [12:32:53] (03PS11) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [12:33:07] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11298079 (10seanleong-WMDE) NDA signed on my end. Thanks! [12:33:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84258 and previous config saved to /var/cache/conftool/dbconfig/20251022-123308-root.json [12:33:09] jouncebot: now [12:33:10] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [12:33:21] o_O why didn’t it respond to reedy? or am I not seeing it [12:33:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11298080 (10MatthewVernon) (to answer the question - like all ms-* nodes, this will continue to be Debian 11 for now, although we might use it for a test... [12:33:38] (03PS2) 10Cory Massaro: Wikifunctions: Update function-evaluators from 2025-10-15-120631 to 2025-10-21-143846. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198039 (https://phabricator.wikimedia.org/T381060) [12:34:40] (03CR) 10Majavah: [C:03+2] P:toolforge: Move toolviews processing to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1197308 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:36:23] Lucas_WMDE: there's a typo in R.eedy's command [12:37:02] cmooney@cumin1003 provision (PID 2610964) is awaiting input [12:37:16] (03PS1) 10Giuseppe Lavagetto: New logo; rate-limit by wmfuniq [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1198041 [12:37:34] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:37:51] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] New logo; rate-limit by wmfuniq [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1198041 (owner: 10Giuseppe Lavagetto) [12:38:21] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:38:37] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:38:38] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Rate-limit by wmfuniq - oblivian@cumin1003" [12:38:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11298106 (10elukey) @Papaul @Jhancock.wm I went into System Setup (F2) -> Device -> Raid controller and used the erase function on both 480GB SSDs, clear... [12:38:40] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Rate-limit by wmfuniq - oblivian@cumin1003 [12:39:14] (03CR) 10Matthias Mullie: [C:03+1] Deploy the ReaderExperiments extension to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198016 (https://phabricator.wikimedia.org/T406907) (owner: 10Marco Fossati) [12:39:26] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Rate-limit by wmfuniq - oblivian@cumin1003 [12:39:27] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Rate-limit by wmfuniq - oblivian@cumin1003" [12:39:59] (03PS1) 10Reedy: CommonSettings.php: Set $wgOATHRecoveryCodesCount = 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198042 (https://phabricator.wikimedia.org/T407167) [12:40:02] taavi: ah ^^ [12:40:19] and I guess jouncebot doesn’t reply “I don’t understand” like some other bots do (stashbot?) [12:40:30] (nope wasn’t stashbot apparently ^^) [12:40:30] it should do string distance and work out if it's close enough to a command it knows [12:40:49] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:40:50] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:41:06] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_codfw and A:cp [12:41:20] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_codfw and A:cp [12:42:31] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Set $wgOATHRecoveryCodesCount = 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198042 (https://phabricator.wikimedia.org/T407167) (owner: 10Reedy) [12:43:22] (03Merged) 10jenkins-bot: CommonSettings.php: Set $wgOATHRecoveryCodesCount = 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198042 (https://phabricator.wikimedia.org/T407167) (owner: 10Reedy) [12:43:53] cmooney@cumin1003 provision (PID 2617100) is awaiting input [12:44:38] (03PS1) 10Majavah: toolforge: toolviews: Drop nginx support [puppet] - 10https://gerrit.wikimedia.org/r/1198045 (https://phabricator.wikimedia.org/T284558) [12:45:42] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:45:48] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:47:00] (03CR) 10Phuedx: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841) (owner: 10Awight) [12:47:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84259 and previous config saved to /var/cache/conftool/dbconfig/20251022-124720-root.json [12:48:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 60%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84260 and previous config saved to /var/cache/conftool/dbconfig/20251022-124814-root.json [12:48:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:48:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:50:56] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2028.codfw.wmnet [12:53:35] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2027.codfw.wmnet [12:53:49] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [12:54:02] !log reedy@deploy2002 Synchronized wmf-config/CommonSettings.php: T407167 (duration: 08m 29s) [12:54:07] T407167: Only One Recovery codes given - https://phabricator.wikimedia.org/T407167 [12:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:55:39] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Drop proxy IP rate limit exemption [puppet] - 10https://gerrit.wikimedia.org/r/1198049 (https://phabricator.wikimedia.org/T283948) [12:55:41] (03PS1) 10Majavah: P:toolforge: Remove separate proxy role [puppet] - 10https://gerrit.wikimedia.org/r/1198050 (https://phabricator.wikimedia.org/T283948) [12:55:43] (03PS1) 10Majavah: P:toolforge: Remove long-obsolete proxylistener systemd unit code [puppet] - 10https://gerrit.wikimedia.org/r/1198051 [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1300). [13:00:05] seanleong-wmde and mfossati: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:16] I can probably deploy in a few minutes but want to finish a code review first [13:00:40] (03CR) 10Slyngshede: [C:03+1] admin: add lsandergreen to fr-tech-devs, add ssh [puppet] - 10https://gerrit.wikimedia.org/r/1198033 (https://phabricator.wikimedia.org/T406927) (owner: 10Kamila Součková) [13:00:45] hi there! [13:01:06] (03CR) 10Tiziano Fogli: [C:03+1] Netops BGP alert: make core bgp group names to be case insensitive [alerts] - 10https://gerrit.wikimedia.org/r/1198034 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [13:01:25] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:01:26] I can self-deploy [13:01:44] (03CR) 10Cathal Mooney: [C:03+2] Netops BGP alert: make core bgp group names to be case insensitive [alerts] - 10https://gerrit.wikimedia.org/r/1198034 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [13:02:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84262 and previous config saved to /var/cache/conftool/dbconfig/20251022-130226-root.json [13:02:49] mfossati: go ahead :) [13:02:59] all right [13:03:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84263 and previous config saved to /var/cache/conftool/dbconfig/20251022-130320-root.json [13:03:22] (03Merged) 10jenkins-bot: Netops BGP alert: make core bgp group names to be case insensitive [alerts] - 10https://gerrit.wikimedia.org/r/1198034 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [13:03:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198016 (https://phabricator.wikimedia.org/T406907) (owner: 10Marco Fossati) [13:03:28] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [13:03:39] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:04:16] (03Merged) 10jenkins-bot: Deploy the ReaderExperiments extension to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198016 (https://phabricator.wikimedia.org/T406907) (owner: 10Marco Fossati) [13:04:46] !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1198016|Deploy the ReaderExperiments extension to English Wikipedia (T406907)]] [13:04:50] T406907: Reader Experiments: Deploy extension to English Wikipedia - https://phabricator.wikimedia.org/T406907 [13:06:45] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:08:52] elukey@cumin1003 reimage (PID 2640471) is awaiting input [13:09:15] !log mfossati@deploy2002 mfossati: Backport for [[gerrit:1198016|Deploy the ReaderExperiments extension to English Wikipedia (T406907)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:09:35] Let me check [13:10:05] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [13:10:58] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:11:07] it works! [13:11:11] !log mfossati@deploy2002 mfossati: Continuing with sync [13:12:45] (03CR) 10Urbanecm: "question: what happens if we enable Revise Tone _without_ edit check?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) (owner: 10Michael Große) [13:15:18] !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198016|Deploy the ReaderExperiments extension to English Wikipedia (T406907)]] (duration: 10m 32s) [13:15:22] T406907: Reader Experiments: Deploy extension to English Wikipedia - https://phabricator.wikimedia.org/T406907 [13:15:31] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11298181 (10elukey) Provisioned the host, retried a reimage, but it didn't boot in d-i. I checked on the DCHP server: ` elukey@install2005:~$ sudo journalctl -u isc-dhcp-server.s... [13:16:09] Lucas_WMDE: all done here :-) [13:16:12] \o/ [13:16:27] I’ll wait for ca. 15 minutes to see in sean shows up, calendar says he might be in a meeting at the moment [13:18:19] Hi, sorry I am late, is the deployment still ongoing? [13:18:20] hi seanleong-wmde! [13:18:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84264 and previous config saved to /var/cache/conftool/dbconfig/20251022-131826-root.json [13:18:31] yes, we can deploy now [13:18:49] Okay! I missed the ytd's one as well, sorry :/ [13:19:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) (owner: 10Seanleong-wmde) [13:19:56] (03Merged) 10jenkins-bot: Set Alias entity usage modifier limit to 10. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) (owner: 10Seanleong-wmde) [13:20:32] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1180523|Set Alias entity usage modifier limit to 10. (T401288)]] [13:20:37] T401288: Implement a more granular alias usage tracking - https://phabricator.wikimedia.org/T401288 [13:21:44] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1012.eqiad.wmnet, repooling both afterwards [13:21:49] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [13:22:46] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:24:21] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11298199 (10elukey) I think If found a possible lead - `HTTPSBootChecksHostname` may be the problem, since we use a bare IP when doing the HTTP boot. I am not able to set "Disable... [13:24:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11298201 (10Marostegui) I've expanded their logical volume to use most of the disk as we normally do ` root@clouddb1022:~# pvs PV VG Fmt Attr PS... [13:25:16] !log lucaswerkmeister-wmde@deploy2002 seanleong-wmde, lucaswerkmeister-wmde: Backport for [[gerrit:1180523|Set Alias entity usage modifier limit to 10. (T401288)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:25:46] testing now [13:25:46] seanleong-wmde: please test :) [13:25:49] ok [13:25:50] okie [13:26:11] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:26:34] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1012.eqiad.wmnet, repooling both afterwards [13:27:14] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ssw1-d1-eqiad with reason: downtime ssw1-d1-eqiad until we have the monitoring checks fully working for the new platform [13:27:25] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11298210 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3ced65be-cbbb-4ba9-91b3-b0f2c626ba79) set by cmo... [13:28:42] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1013.eqiad.wmnet, repooling both afterwards [13:28:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11298212 (10Papaul) @elukey @MatthewVernon thank you that was very helpful information. Now I can answer you question "In UEFI Boot Mode, fixed media (s... [13:28:47] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [13:29:14] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11298214 (10elukey) @Jhancock.wm Hi! The host seems stuck again after trying `reset /system1/pwrmgtsvc1`, it feels like there is something wrong with the host. What do you think? [13:29:35] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832#11298217 (10Jclark-ctr) 05Open→03Resolved a:05Marostegui→03Jclark-ctr [13:29:36] (03CR) 10Ssingh: dnsrecursor: use config dir instead of standalone file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [13:29:47] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832#11298221 (10Jclark-ctr) [13:30:49] (03PS1) 10Marostegui: clouddb102[25]: Add hieradata file [puppet] - 10https://gerrit.wikimedia.org/r/1198064 (https://phabricator.wikimedia.org/T393733) [13:31:44] Looks good so far [13:31:53] ok [13:32:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11298231 (10elukey) @Papaul this is true, the debian installer is the one that eventually sets the proper boot disk, but in all other models we have a ge... [13:32:02] did you find a page with >10 alias usages? [13:32:09] (nothing in mwdebug so far) [13:32:12] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2030.codfw.wmnet [13:32:21] (03CR) 10FNegri: [C:03+1] clouddb102[25]: Add hieradata file [puppet] - 10https://gerrit.wikimedia.org/r/1198064 (https://phabricator.wikimedia.org/T393733) (owner: 10Marostegui) [13:32:24] I created a module for it and it's working, but not sure about current pages [13:32:28] ah, ok [13:32:33] (03CR) 10Marostegui: [C:03+2] clouddb102[25]: Add hieradata file [puppet] - 10https://gerrit.wikimedia.org/r/1198064 (https://phabricator.wikimedia.org/T393733) (owner: 10Marostegui) [13:33:31] (03PS1) 10CDanis: haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 [13:33:42] I’m trying out an SQL query [13:33:43] SELECT eu_page_id, eu_entity_id, COUNT(*) FROM wbc_entity_usage WHERE eu_aspect LIKE 'A.%' GROUP BY eu_page_id, eu_entity_id HAVING COUNT(*) > 10 LIMIT 10; [13:33:47] not sure if it’ll work [13:33:58] (03CR) 10CI reject: [V:04-1] haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (owner: 10CDanis) [13:34:04] that's a great idea [13:34:18] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1013.eqiad.wmnet, repooling both afterwards [13:34:23] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [13:34:47] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2029.codfw.wmnet [13:34:59] im running it on hewiki [13:35:10] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1014.eqiad.wmnet, repooling both afterwards [13:35:17] (03PS1) 10Marostegui: mariadb: Decommission es1030 [puppet] - 10https://gerrit.wikimedia.org/r/1198066 (https://phabricator.wikimedia.org/T407953) [13:35:19] since that has the highest chance of having it as we ran the script there to update to alias already [13:35:44] (03PS2) 10CDanis: haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (https://phabricator.wikimedia.org/T406990) [13:36:03] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1030.eqiad.wmnet [13:36:05] Empty set (1 min 59.356 sec) [13:36:08] well, so much for that idea [13:36:10] (03PS3) 10CDanis: haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (https://phabricator.wikimedia.org/T406990) [13:36:11] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [13:36:16] ok, maybe it’ll work on hewiki [13:36:30] seanleong-wmde: where are you running the query? quarry? stat servers? something else? [13:36:34] hahaha no results as well [13:36:35] quarry [13:36:37] ok [13:36:43] just wanted to make sure you’re not using the production servers :D [13:36:46] (03PS2) 10Marostegui: mariadb: Decommission es1030 [puppet] - 10https://gerrit.wikimedia.org/r/1198066 (https://phabricator.wikimedia.org/T407953) [13:36:52] then I guess we can’t really verify that it makes a difference or not [13:36:54] on existing pages at least [13:37:00] let’s trust your test module [13:37:08] yeaa, I will continue to try and find after sync? [13:37:09] !log lucaswerkmeister-wmde@deploy2002 seanleong-wmde, lucaswerkmeister-wmde: Continuing with sync [13:37:14] eh, no need imho [13:37:22] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission es1030 [puppet] - 10https://gerrit.wikimedia.org/r/1198066 (https://phabricator.wikimedia.org/T407953) (owner: 10Marostegui) [13:37:31] okay [13:39:58] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1014.eqiad.wmnet, repooling both afterwards [13:40:00] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1015.eqiad.wmnet, repooling both afterwards [13:40:03] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [13:41:19] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1180523|Set Alias entity usage modifier limit to 10. (T401288)]] (duration: 20m 47s) [13:41:24] T401288: Implement a more granular alias usage tracking - https://phabricator.wikimedia.org/T401288 [13:41:49] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [13:42:08] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2249:9290 - https://phabricator.wikimedia.org/T407879#11298266 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:43:23] !log UTC afternoon backport+config window done [13:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:47] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1015.eqiad.wmnet, repooling both afterwards [13:44:49] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling both afterwards [13:45:06] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1030.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [13:45:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1030.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [13:45:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:45:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1030.eqiad.wmnet [13:45:23] thanks! Lucas_WMDE [13:45:31] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953#11298285 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1003 for hosts: `es1030.eqiad.wmnet` - es1030.eqiad.wmnet (**... [13:45:40] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953#11298286 (10Marostegui) a:05Marostegui→03None This is ready for #dc-ops [13:46:04] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953#11298291 (10Marostegui) [13:46:54] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11298295 (10Jhancock.wm) 05Open→03Resolved [13:47:02] (03CR) 10Joely Rooke WMDE: "Hi there! I am wondering if it's possible to keep this stream open for future tracking usages?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198029 (https://phabricator.wikimedia.org/T370045) (owner: 10Phuedx) [13:47:36] (03CR) 10David Caro: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1198045 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [13:49:03] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1198049 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [13:49:39] PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:02] (03CR) 10Vgutierrez: [C:03+1] haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [13:50:46] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling both afterwards [13:50:48] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1017.eqiad.wmnet, repooling both afterwards [13:50:51] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [13:51:30] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953#11298310 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [13:55:34] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1017.eqiad.wmnet, repooling both afterwards [13:55:36] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1018.eqiad.wmnet, repooling both afterwards [13:56:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:21] Lucas_WMDE https://en.wikipedia.org/w/index.php?title=Template:Sandbox/Seanleong8/Blank&action=info the changes are showing in live wiki now, thanks! [13:58:51] \o/ [13:59:46] o7 [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1400) [14:00:23] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1018.eqiad.wmnet, repooling both afterwards [14:00:25] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1019.eqiad.wmnet, repooling both afterwards [14:00:29] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [14:01:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:02:45] (03PS1) 10Ladsgroup: mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) [14:02:53] (03CR) 10Michael Große: "The Check would not show." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) (owner: 10Michael Große) [14:03:07] (03PS2) 10Ladsgroup: mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) [14:03:50] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11298377 (10Andrew) In any case, it's clear that preseed-test isn't going to help with the actual issue on 2010-dev :/ [14:03:52] (03CR) 10CI reject: [V:04-1] mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup) [14:05:11] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1019.eqiad.wmnet, repooling both afterwards [14:05:13] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1020.eqiad.wmnet, repooling both afterwards [14:05:19] (03CR) 10Cory Massaro: [C:03+2] Wikifunctions: Update function-evaluators from 2025-10-15-120631 to 2025-10-21-143846. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198039 (https://phabricator.wikimedia.org/T381060) (owner: 10Cory Massaro) [14:05:24] (03CR) 10Stevemunene: [C:03+2] superset: Increase the nginx proxy timeout (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene) [14:06:58] (03Merged) 10jenkins-bot: Wikifunctions: Update function-evaluators from 2025-10-15-120631 to 2025-10-21-143846. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198039 (https://phabricator.wikimedia.org/T381060) (owner: 10Cory Massaro) [14:07:16] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:07:39] (03Merged) 10jenkins-bot: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene) [14:07:58] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:08:44] (03PS3) 10Ladsgroup: mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) [14:09:40] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:09:47] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1020.eqiad.wmnet, repooling both afterwards [14:09:48] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1021.eqiad.wmnet, repooling both afterwards [14:09:52] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [14:10:26] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:10:36] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:11:07] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup) [14:11:22] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:11:30] (03CR) 10Majavah: [C:03+2] toolforge: toolviews: Drop nginx support [puppet] - 10https://gerrit.wikimedia.org/r/1198045 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [14:11:36] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2032.codfw.wmnet [14:12:00] (03CR) 10Cory Massaro: [C:03+2] Wikifunctions: Upgrade orchestrator from 2025-10-14-194525 to 2025-10-22-011302. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198036 (https://phabricator.wikimedia.org/T381060) (owner: 10Cory Massaro) [14:13:36] (03Merged) 10jenkins-bot: Wikifunctions: Upgrade orchestrator from 2025-10-14-194525 to 2025-10-22-011302. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198036 (https://phabricator.wikimedia.org/T381060) (owner: 10Cory Massaro) [14:14:33] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1021.eqiad.wmnet, repooling both afterwards [14:14:35] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1022.eqiad.wmnet, repooling both afterwards [14:14:48] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2031.codfw.wmnet [14:15:33] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:16:00] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:16:15] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:16:43] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:16:50] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:17:15] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:18:11] 07sre-alert-triage, 06SRE Observability (FY2025/2026-Q2): Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100) - https://phabricator.wikimedia.org/T407484#11298507 (10hnowlan) [14:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:19:21] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1022.eqiad.wmnet, repooling both afterwards [14:19:25] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [14:20:12] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [14:20:44] (03CR) 10FNegri: [C:03+1] P:toolforge::k8s::haproxy: Drop proxy IP rate limit exemption [puppet] - 10https://gerrit.wikimedia.org/r/1198049 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [14:21:39] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [14:21:39] (03PS1) 10Filippo Giunchedi: installserver: revert 'cloudcontrol2010-dev' to standard recipes [puppet] - 10https://gerrit.wikimedia.org/r/1198082 (https://phabricator.wikimedia.org/T407586) [14:22:18] (03CR) 10Andrew Bogott: [C:03+1] installserver: revert 'cloudcontrol2010-dev' to standard recipes [puppet] - 10https://gerrit.wikimedia.org/r/1198082 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [14:22:32] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] installserver: revert 'cloudcontrol2010-dev' to standard recipes [puppet] - 10https://gerrit.wikimedia.org/r/1198082 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [14:24:13] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Drop proxy IP rate limit exemption [puppet] - 10https://gerrit.wikimedia.org/r/1198049 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [14:25:57] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11298544 (10Jhancock.wm) @elukey found the server up. maybe it takes 5 million years to boot? i remember some of the ms-be supermicro servers had the same issue before with a slig... [14:26:39] (03CR) 10Elukey: [C:03+1] "Left a comment, if it is not a concern go ahead!" [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [14:27:15] (03CR) 10Elukey: [C:03+1] "It is fine to have this snippet in two places, but long term we may want to have it somewhere reusable/more-DRY :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198026 (owner: 10Cathal Mooney) [14:28:38] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1430) [14:30:21] (03PS12) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [14:30:49] (03CR) 10CI reject: [V:04-1] dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [14:30:54] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Deploy rate limiting in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [14:31:28] (03CR) 10Elukey: "Left some comments!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi) [14:34:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:29] (03PS13) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [14:36:38] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [14:39:24] (03CR) 10CDobbins: [V:03+1] dnsrecursor: use config dir instead of standalone file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [14:39:39] (03CR) 10CDanis: [C:03+2] haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [14:42:49] (03CR) 10Tiziano Fogli: "We need to remove this declaration to avoid a duplicate resource declaration when including the pilot instance (tested on Pontoon)." [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [14:43:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11298596 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date. [14:44:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11298598 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date. [14:44:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11298600 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date. [14:44:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11298614 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date. [14:44:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11298616 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date. [14:44:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11298619 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date. [14:44:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11298620 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date. [14:44:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11298621 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date. [14:48:28] (03PS4) 10Ladsgroup: mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) [14:48:33] (03CR) 10Ladsgroup: [C:03+2] mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup) [14:48:35] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup) [14:50:57] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2034.codfw.wmnet [14:55:13] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2033.codfw.wmnet [14:57:20] (03PS3) 10Dr0ptp4kt: profile::pyrra: add two Xlab SLOs under the data-platform namespace [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [14:58:11] 06SRE, 10SRE-Access-Requests: Requesting access to production for dpogorzelski - https://phabricator.wikimedia.org/T407955#11298681 (10calbon) I approve this request [14:58:41] !log filippo@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [14:59:10] (03CR) 10Dr0ptp4kt: [C:03+1] "I added `prometheus=\"k8s\"` to the definitions. Otherwise LGTM. +1'ing, for your next move @ltoscano@wikimedia.org . Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [15:01:54] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11298704 (10dancy) [15:02:16] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991 (10Jhancock.wm) 03NEW [15:02:58] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11298723 (10dancy) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197986 has caused puppet to break on `deployment-cache-upload08.deployment-prep`. Please help! [15:06:07] (03PS1) 10Ladsgroup: mariadb::research: Add ferm hole for mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089) [15:07:19] (03PS2) 10Ladsgroup: mariadb::research: Add ferm hole for mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089) [15:08:25] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11298750 (10ssingh) >>! In T404826#11298704, @dancy wrote: > https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197986 has caused puppet to break on `deployment-cache-upload... [15:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:12] (03PS3) 10Ladsgroup: mariadb::research: Add ferm hole for mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089) [15:11:41] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup) [15:13:32] (03CR) 10Cathal Mooney: "Thanks! Yeah I _thought_ the provision cookbook calls the sre.network.configure-switch-interfaces cookbook, but it seems it runs the func" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198026 (owner: 10Cathal Mooney) [15:13:36] (03CR) 10Cathal Mooney: [C:03+2] sre.hosts.provision: adjust to always use Homer to config Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198026 (owner: 10Cathal Mooney) [15:16:29] (03CR) 10Marostegui: [C:03+1] mariadb::research: Add ferm hole for mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup) [15:18:21] (03PS1) 10Cathal Mooney: team-netops: add checks against Nokia OSPF status [alerts] - 10https://gerrit.wikimedia.org/r/1198095 (https://phabricator.wikimedia.org/T405558) [15:18:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11298804 (10Papaul) @elukey on can you please provide me with one of the node that is working like you said so i can check what is different from this no... [15:20:09] (03Merged) 10jenkins-bot: sre.hosts.provision: adjust to always use Homer to config Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198026 (owner: 10Cathal Mooney) [15:21:32] (03PS2) 10Cathal Mooney: team-netops: add checks against Nokia OSPF status [alerts] - 10https://gerrit.wikimedia.org/r/1198095 (https://phabricator.wikimedia.org/T405558) [15:23:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11298825 (10Marostegui) [15:23:56] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11298826 (10Marostegui) The patch was done before this task got created, but linking it here for clarity https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197750 [15:24:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11298828 (10Marostegui) [15:24:15] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11298834 (10Jhancock.wm) yes forgot to mention that while making this one. thank you so much for getting it done early! [15:25:06] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11298835 (10Jhancock.wm) a:05Jhancock.wm→03None [15:30:13] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2036.codfw.wmnet [15:31:34] (03PS1) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [15:32:59] (03CR) 10Cathal Mooney: Nokia BGP: add function to get policy names based on BGP group name (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [15:34:10] (03CR) 10Ladsgroup: [C:03+2] mariadb::research: Add ferm hole for mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup) [15:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:21] (03CR) 10Cathal Mooney: [C:03+2] Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [15:34:25] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2035.codfw.wmnet [15:36:04] (03Merged) 10jenkins-bot: Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [15:39:27] 06SRE, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994 (10amastilovic) 03NEW [15:42:50] (03PS1) 10Jcrespo: transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104 [15:43:38] (03PS9) 10Btullis: Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) [15:44:33] (03CR) 10Dreamy Jazz: hCaptcha: Enable hCaptcha for form edits on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [15:45:12] (03CR) 10Dreamy Jazz: hCaptcha: Enable hCaptcha for form edits on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [15:45:39] (03CR) 10Dreamy Jazz: hCaptcha: Enable hCaptcha for form edits on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [15:45:56] (03PS2) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727) [15:48:24] (03PS2) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [15:48:57] (03CR) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [15:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:50:00] (03CR) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [15:50:05] (03CR) 10Kosta Harlan: [C:04-1] hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [15:50:27] (03CR) 10Kamila Součková: [C:03+2] admin: add lsandergreen to fr-tech-devs, add ssh [puppet] - 10https://gerrit.wikimedia.org/r/1198033 (https://phabricator.wikimedia.org/T406927) (owner: 10Kamila Součková) [15:52:59] (03CR) 10MSantos: [C:03+1] fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [15:58:06] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [15:59:47] (03PS2) 10Jcrespo: transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104 [16:00:50] 06SRE, 10SRE-Access-Requests: Requesting access to production for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299061 (10Raine) a:03Raine [16:03:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11299082 (10elukey) @Papaul this is the first dell config j that we flip to UEFI :) [16:04:03] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on zuul2002.codfw.wmnet with reason: still in setup [16:05:15] (03PS2) 10Majavah: P:toolforge: Remove separate proxy role [puppet] - 10https://gerrit.wikimedia.org/r/1198050 (https://phabricator.wikimedia.org/T283948) [16:05:15] (03PS2) 10Majavah: P:toolforge: Remove long-obsolete proxylistener systemd unit code [puppet] - 10https://gerrit.wikimedia.org/r/1198051 [16:05:15] (03PS1) 10Majavah: P:toolforge::prometheus: Drop separate front proxy scrape target [puppet] - 10https://gerrit.wikimedia.org/r/1198105 (https://phabricator.wikimedia.org/T283948) [16:05:28] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:05:34] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:06:00] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=bswiki --logwiki=metawiki Horvathbence200603 HorvBence # T407995 [16:06:04] T407995: Unblock stuck global rename of HorvBence - https://phabricator.wikimedia.org/T407995 [16:06:12] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Make cloudweb Icinga checks non-critical [puppet] - 10https://gerrit.wikimedia.org/r/1196019 (https://phabricator.wikimedia.org/T407208) (owner: 10Majavah) [16:08:30] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:08:36] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:11:26] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:11:29] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2038.codfw.wmnet [16:11:31] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:16:30] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1198105 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [16:16:34] !log filippo@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [16:16:42] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2037.codfw.wmnet [16:16:43] (03PS3) 10Jcrespo: transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104 [16:16:58] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Drop separate front proxy scrape target [puppet] - 10https://gerrit.wikimedia.org/r/1198105 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [16:19:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:36] (03PS1) 10Cathal Mooney: sre.hosts.provision: add code to support Homer/Nokia to Dell section [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) [16:23:11] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:23:31] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:27:04] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: add code to support Homer/Nokia to Dell section [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: 10Cathal Mooney) [16:34:14] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11299214 (10Raine) 05In progress→03Resolved a:03Raine Done, @Lars let me know if anything isn't working :-) [16:37:23] (03CR) 10BCornwall: [V:03+2 C:03+2] varnish: Implement enable_m_redir and enable on test wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [16:37:50] (03CR) 10BCornwall: [C:03+2] varnish: Enable enable_m_redir in Beta Cluster for all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197693 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [16:38:57] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2203.codfw.wmnet [16:38:59] !log kamila@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2203.codfw.wmnet [16:39:11] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:40:20] !log kamila@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker2203.codfw.wmnet with reason: host unresponsive [16:43:25] cmooney@cumin1003 provision (PID 2858243) is awaiting input [16:44:15] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:44:38] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:51:02] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2040.codfw.wmnet [16:51:51] (03CR) 10Btullis: [C:03+2] Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [16:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:56:04] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11299270 (10Jclark-ctr) a:05BTullis→03None [16:56:10] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11299271 (10Jclark-ctr) a:03Jclark-ctr [16:56:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11299272 (10Papaul) @elukey i think the next step will be to try to install the OS without setting up the boot disk and let the OS take care of it. mayb... [16:57:53] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:58:18] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2039.codfw.wmnet [16:59:47] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS trixie [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1700) [17:01:02] 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004 (10Raine) 03NEW p:05Triage→03Low [17:03:44] (03CR) 10Jsn.sherman: [C:03+1] Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [17:09:56] (03CR) 10Dzahn: zookeeper: add support for TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:10:02] !log kamila@cumin1003 START - Cookbook sre.dns.netbox [17:11:03] (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Check for same log_actor between local and global log entry [extensions/CentralAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198119 (https://phabricator.wikimedia.org/T398177) [17:11:11] (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Check for same log_actor between local and global log entry [extensions/CentralAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198120 (https://phabricator.wikimedia.org/T398177) [17:11:11] (03CR) 10Dzahn: [C:03+1] gerrit: unmask service & disable backup temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [17:11:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CentralAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198119 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [17:11:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CentralAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198120 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [17:12:43] !log kamila@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:12:51] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS trixie [17:14:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: 10D3r1ck01) [17:15:58] 06SRE, 10SRE-Access-Requests: Requesting access to production for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299360 (10Raine) [17:16:40] 06SRE, 10SRE-Access-Requests: Requesting access to production for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299376 (10Raine) @mark can you please approve this from the SRE side? Thanks! [17:19:31] (03PS1) 10Dzahn: zuul: temporarily make zuul2002 use nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1198123 [17:19:56] (03PS2) 10Dzahn: zuul: temporarily make zuul2002 use nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1198123 [17:20:12] (03CR) 10Dzahn: [C:03+2] zuul: temporarily make zuul2002 use nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1198123 (owner: 10Dzahn) [17:24:39] (03PS1) 10Dzahn: zuul::base: pass srange firewall parameter as an array [puppet] - 10https://gerrit.wikimedia.org/r/1198126 [17:24:58] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1197720 (https://phabricator.wikimedia.org/T407917) (owner: 10Dzahn) [17:25:34] 06SRE, 10SRE-Access-Requests: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299432 (10Raine) a:05Raine→03mark [17:26:36] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11299457 (10Raine) a:03KFrancis [17:27:39] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1198126/7370/" [puppet] - 10https://gerrit.wikimedia.org/r/1198126 (owner: 10Dzahn) [17:30:25] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2042.codfw.wmnet [17:30:26] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_codfw and A:cp [17:31:15] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS trixie [17:36:59] (03PS1) 10Dzahn: zuul::main: add firewall src sets CACHES to envoy Hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/1198127 (https://phabricator.wikimedia.org/T395938) [17:37:57] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2041.codfw.wmnet [17:37:57] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_codfw and A:cp [17:41:47] I've been finding Gerrit really unreliable throughout today [17:42:09] Like connections being dropped entirely, and then only parts of the page loading [17:42:14] 06SRE, 10SRE-Access-Requests: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299634 (10Raine) confirmed key oob [17:42:17] Sometimes thinking I'm signed out entirely [17:42:24] but the next page load I am signed in [17:42:44] Is this known? [17:42:52] (03PS3) 10Krinkle: varnish: Remove unreachable optin=beta code [puppet] - 10https://gerrit.wikimedia.org/r/1197730 (https://phabricator.wikimedia.org/T405931) [17:43:11] (03PS6) 10Krinkle: varnish: Enable enable_m_redir in esams and drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1197694 (https://phabricator.wikimedia.org/T405931) [17:43:14] (03PS1) 10Dzahn: site: move zuul2002 to insetup role temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1198128 [17:43:16] (03PS10) 10Krinkle: varnish: Enable enable_m_redir everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1197695 (https://phabricator.wikimedia.org/T405931) [17:43:40] (03CR) 10Dzahn: [C:03+2] zuul::main: add firewall src sets CACHES to envoy Hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/1198127 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:43:41] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197694 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [17:44:13] FIRING: SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [17:45:14] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#11299639 (10Ladsgroup) 05Open→03Resolved I fully set up the VM now. Some automation is needed which I file a ticket for that later. [17:47:38] 06SRE, 10SRE-Access-Requests: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299651 (10Raine) [17:53:17] !log mwscript-k8s --dblist=small --follow -- purgeUserOptions.php --login-age 11 (T406724) [17:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:22] T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724 [17:55:57] (03CR) 10Kamila Součková: [C:03+2] "key verified OOB" [puppet] - 10https://gerrit.wikimedia.org/r/1197720 (https://phabricator.wikimedia.org/T407917) (owner: 10Dzahn) [18:00:05] dancy and andre: That opportune time for a MediaWiki train - Utc-7+Utc-0 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1800). [18:02:33] (03PS1) 10CDanis: haproxy: x-is-browser: --> Data Lake [puppet] - 10https://gerrit.wikimedia.org/r/1198130 [18:03:08] o/ [18:04:05] (03PS1) 10Kosta Harlan: Instrument the Suggested investigations feature [extensions/CheckUser] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198131 (https://phabricator.wikimedia.org/T404177) [18:04:28] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1198128/7372/" [puppet] - 10https://gerrit.wikimedia.org/r/1198128 (owner: 10Dzahn) [18:05:48] (03PS1) 10Ssingh: varnish: add conditional to varnish::common::vcl for beta [puppet] - 10https://gerrit.wikimedia.org/r/1198132 (https://phabricator.wikimedia.org/T407966) [18:06:57] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1198132 (https://phabricator.wikimedia.org/T407966) (owner: 10Ssingh) [18:07:14] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198133 (https://phabricator.wikimedia.org/T405680) [18:07:16] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198133 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [18:08:08] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198133 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [18:08:49] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on zuul2001.codfw.wmnet with reason: still in setup [18:09:30] !log deleting local user_password on sul wikis (T104500) [18:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:34] T104500: Old versions of sensitive user data (email, password hashes) can remain in database indefinitely due to local and global DB not being kept in sync - https://phabricator.wikimedia.org/T104500 [18:09:55] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: still in setup [18:11:29] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on zuul2001.codfw.wmnet with reason: still in setup [18:12:48] (03PS3) 10Jcrespo: [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) [18:13:18] (03PS4) 10Jcrespo: transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104 [18:14:40] 06SRE, 10SRE-Access-Requests: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008 (10Urbanecm) 03NEW [18:14:44] (03CR) 10Jcrespo: "I am doing a deeper refactor, but I am implementing essentially your solution here:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff) [18:16:27] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.24 refs T405680 [18:16:32] T405680: 1.45.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T405680 [18:17:11] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS trixie [18:17:28] (03CR) 10Jcrespo: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [18:18:00] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS trixie [18:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:24:20] (03CR) 10Ssingh: [V:03+1] "Original change I81ab37d461e0893d251fb9ad6026472b103b574c" [puppet] - 10https://gerrit.wikimedia.org/r/1198132 (https://phabricator.wikimedia.org/T407966) (owner: 10Ssingh) [18:26:36] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:28:34] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [18:29:53] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1025.eqiad.wmnet, repooling both afterwards [18:29:58] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [18:31:21] 06SRE, 06Commons, 10MediaWiki-Special-pages: Special:Watchlist on Commons throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009 (10Josve05a) 03NEW [18:31:38] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 221.94 ms [18:34:05] 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team: Special:Watchlist on Commons throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009#11299832 (10A_smart_kitten) [18:34:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:34:40] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1025.eqiad.wmnet, repooling both afterwards [18:34:42] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1026.eqiad.wmnet, repooling both afterwards [18:36:19] 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist on Commons throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009#11299851 (10Josve05a) [18:37:47] 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist on Commons throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009#11299860 (10Josve05a) [18:38:33] 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist on Commons throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009#11299863 (10Josve05a) >>! In T408010#11299812, @Xaosflux wrote: > {F66781980} > > Able to... [18:38:50] 06SRE, 10SRE-Access-Requests: replace ssh keys with yubikey-backed key for Daniel Z - https://phabricator.wikimedia.org/T407917#11299865 (10Dzahn) a:03Dzahn [18:39:28] 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist on different wikis throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009#11299869 (10Josve05a) [18:39:28] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1026.eqiad.wmnet, repooling both afterwards [18:39:32] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [18:39:49] 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist throws ‘InvalidArgumentException’ fatal error on multiple projects - https://phabricator.wikimedia.org/T408009#11299872 (10Xaosflux) [18:45:04] (03PS1) 10Bking: WIP: deploy a test OpenSearch cluster in opensearch-ipoid-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) [18:48:14] (03CR) 10Ssingh: dnsrecursor: use config dir instead of standalone file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [18:49:36] 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist throws ‘InvalidArgumentException’ fatal error on multiple projects - https://phabricator.wikimedia.org/T408009#11299889 (10Zabe) →14Duplicate dup:03T407996 [18:51:23] (03CR) 10CDanis: [C:03+1] varnish: add conditional to varnish::common::vcl for beta [puppet] - 10https://gerrit.wikimedia.org/r/1198132 (https://phabricator.wikimedia.org/T407966) (owner: 10Ssingh) [18:53:58] !log sudo cumin "A:cp" "disable-puppet 'merging CR 1198132'" [18:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:30] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:01:28] (03CR) 10Ssingh: [V:03+1 C:03+2] varnish: add conditional to varnish::common::vcl for beta [puppet] - 10https://gerrit.wikimedia.org/r/1198132 (https://phabricator.wikimedia.org/T407966) (owner: 10Ssingh) [19:02:55] (03CR) 10Dzahn: [C:03+1] "nothing is buster except puppetmasters and maps:" [puppet] - 10https://gerrit.wikimedia.org/r/1197334 (owner: 10Dzahn) [19:03:47] (03CR) 10Dzahn: [C:03+2] zookeeper: drop safety check for buster, no more buster [puppet] - 10https://gerrit.wikimedia.org/r/1197334 (owner: 10Dzahn) [19:06:49] !log sudo cumin "A:cp" "run-puppet-agent --enable 'merging CR 1198132'" [19:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:59] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:07:17] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:09:49] 06SRE, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994#11299948 (10amastilovic) [19:10:30] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:11:00] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1198095 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [19:11:08] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: sleep test [19:11:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [19:24:15] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:30] (03PS1) 10Ebernhardson: cirrus-streaming-updater: update docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198141 [19:27:34] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:27:54] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:32:59] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:33:08] (03CR) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [19:33:18] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:36:13] (03PS2) 10Ebernhardson: cirrus-streaming-updater: update docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198141 [19:38:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11300023 (10VRiley-WMF) 05Open→03In progress Starting on ms-be1089 [19:38:30] (03PS14) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [19:40:18] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [19:42:04] (03CR) 10CDobbins: [V:03+1] dnsrecursor: use config dir instead of standalone file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [19:44:28] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [19:45:59] (03CR) 10Ebernhardson: [C:03+2] cirrus-streaming-updater: update docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198141 (owner: 10Ebernhardson) [19:47:42] (03Merged) 10jenkins-bot: cirrus-streaming-updater: update docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198141 (owner: 10Ebernhardson) [19:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:51:32] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:51:43] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:52:05] (03PS1) 10Kgraessle: Fix InvalidArgumentException in Watchlist [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198147 (https://phabricator.wikimedia.org/T407996) [19:53:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198147 (https://phabricator.wikimedia.org/T407996) (owner: 10Kgraessle) [19:54:50] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:55:05] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [19:59:54] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11300144 (10KFrancis) The NDA is complete. Thanks! [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T2000) [20:00:05] Krinkle and katherine_g: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet - https://phabricator.wikimedia.org/T407605#11300145 (10KFrancis) The NDA is complete. Thanks! [20:00:18] hi [20:00:37] (03CR) 10Brennen Bearnes: [C:03+1] "With https://gitlab.wikimedia.org/repos/phabricator/deployment/-/merge_requests/84 merged this should be fine." [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [20:00:45] hi [20:02:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [20:02:23] rolling out mine meanwhile [20:03:03] sounds good, I'll deploy after you [20:04:19] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:04:29] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:06:51] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11300161 (10VRiley-WMF) ms-be1089 is completed, moving onto the next server ms-be1090 [20:08:14] (03PS1) 10Dzahn: gerrit: set QoS to log_only [puppet] - 10https://gerrit.wikimedia.org/r/1198148 (https://phabricator.wikimedia.org/T406774) [20:08:25] (03PS2) 10Cathal Mooney: sre.hosts.provision: move the switch config to parent class and run [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) [20:13:13] (03CR) 10Dzahn: [C:03+2] gerrit: set QoS to log_only [puppet] - 10https://gerrit.wikimedia.org/r/1198148 (https://phabricator.wikimedia.org/T406774) (owner: 10Dzahn) [20:13:35] (03Merged) 10jenkins-bot: fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [20:14:09] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1197866|fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration (T407403)]] [20:14:14] T407403: Error: Invalid serialization data for DatePeriod object - https://phabricator.wikimedia.org/T407403 [20:15:50] Krinkle: FWIW https://phabricator.wikimedia.org/T407403#11300176 (although I agree it's fine to just wait and see if anything breaks) [20:18:30] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1197866|fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration (T407403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:19:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:20:12] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS trixie [20:22:17] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:22:35] !log krinkle@deploy2002 krinkle: Continuing with sync [20:22:50] tgr_: thx [20:23:09] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:23:31] andrew@cumin2002 reimage (PID 4063959) is awaiting input [20:25:34] katherine_g: nearly done [20:26:47] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197866|fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration (T407403)]] (duration: 12m 38s) [20:26:52] T407403: Error: Invalid serialization data for DatePeriod object - https://phabricator.wikimedia.org/T407403 [20:29:05] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11300199 (10Andrew) I don't know what a healthy grub run looks like, but I'm not loving this: ` Oct 22 19:46:10 grub-installer: info: Running chroot /target grub-install --for ce "/d... [20:29:23] katherine_g: all yours [20:29:28] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [20:29:30] krinkle: thanks [20:30:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198147 (https://phabricator.wikimedia.org/T407996) (owner: 10Kgraessle) [20:30:39] (03PS18) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) [20:30:59] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [20:31:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:31:20] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [20:31:40] FIRING: DiskSpace: Disk space ml-serve1012:9100:/ 4.769% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:34:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:35:12] (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7396/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [20:35:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [20:36:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:36:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:36:40] RESOLVED: DiskSpace: Disk space ml-serve1012:9100:/ 4.8% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:39:40] FIRING: DiskSpace: Disk space ml-serve1012:9100:/ 4.767% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:41:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:41:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:04] (03CR) 10Herron: [V:03+1] "Thanks, this turned up in pcc as well and I forgot to upload before tagging you. Sorry for the false start, my bad! Sorted out now" [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [20:45:43] (03Merged) 10jenkins-bot: Fix InvalidArgumentException in Watchlist [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198147 (https://phabricator.wikimedia.org/T407996) (owner: 10Kgraessle) [20:46:15] !log kgraessle@deploy2002 Started scap sync-world: Backport for [[gerrit:1198147|Fix InvalidArgumentException in Watchlist (T407996)]] [20:46:20] T407996: InvalidArgumentException: Unknown filter module "latest" - https://phabricator.wikimedia.org/T407996 [20:47:28] (03PS1) 10JHathaway: sysctls: update sysctls 5min after boot [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) [20:49:56] (03CR) 10CI reject: [V:04-1] sysctls: update sysctls 5min after boot [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [20:50:36] !log kgraessle@deploy2002 kgraessle: Backport for [[gerrit:1198147|Fix InvalidArgumentException in Watchlist (T407996)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:52:57] !log kgraessle@deploy2002 kgraessle: Continuing with sync [20:53:16] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:54:01] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:54:27] (03PS2) 10JHathaway: sysctls: update sysctls 5min after boot [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) [20:57:04] !log kgraessle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198147|Fix InvalidArgumentException in Watchlist (T407996)]] (duration: 10m 49s) [20:57:09] T407996: InvalidArgumentException: Unknown filter module "latest" - https://phabricator.wikimedia.org/T407996 [20:58:30] yay my watchlist now works again, thanks katherine_g :D [20:59:03] Josve05a: np :) [20:59:46] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T2100) [21:02:17] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11300347 (10Andrew) Here is the equivalent for bookworm (which works): ` Oct 22 20:45:11 grub-installer: info: Running chroot /target grub-install --for ce "/dev/sdd" Oct 22 20:45:11... [21:04:01] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:30] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [21:09:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [21:13:07] (03PS3) 10JHathaway: sysctls: update sysctls 5min after boot [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) [21:13:21] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [21:20:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:25:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:26:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:27:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.638s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:30:25] (03PS1) 10Dzahn: gerrit: add 2 large prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1198157 (https://phabricator.wikimedia.org/T406774) [21:30:38] (03CR) 10Scott French: "Thanks, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [21:31:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:31:48] (03PS2) 10Dzahn: gerrit: add 2 large prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1198157 (https://phabricator.wikimedia.org/T406774) [21:33:52] (03CR) 10Scott French: [C:03+1] "Thanks, Effie! I believe you should be good to proceed with this patch, as long as you rebase it onto `production` **first** to decouple i" [puppet] - 10https://gerrit.wikimedia.org/r/1198035 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [21:37:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.3s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:37:47] (03CR) 10Dzahn: [C:03+2] gerrit: add 2 large prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1198157 (https://phabricator.wikimedia.org/T406774) (owner: 10Dzahn) [21:37:48] (03PS5) 10RLazarus: deployment_server: Refactor charlie to add a Service dataclass [puppet] - 10https://gerrit.wikimedia.org/r/1195352 [21:37:48] (03PS3) 10RLazarus: deployment_server: Add --priority to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196989 (https://phabricator.wikimedia.org/T406212) [21:37:49] (03PS3) 10RLazarus: deployment_server: Add --dangerously_fast to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196990 (https://phabricator.wikimedia.org/T406212) [21:41:02] (03CR) 10Hashar: gerrit: add 2 large prefixes to abusers list (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198157 (https://phabricator.wikimedia.org/T406774) (owner: 10Dzahn) [21:41:20] (03CR) 10RLazarus: [C:03+2] deployment_server: Prefix `helmfile apply` output with "[service env]" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192282 (owner: 10RLazarus) [21:43:05] (03CR) 10RLazarus: [C:03+2] "Thanks @glavagetto@wikimedia.org for both reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1195352 (owner: 10RLazarus) [21:43:18] (03PS6) 10RLazarus: deployment_server: Refactor charlie to add a Service dataclass [puppet] - 10https://gerrit.wikimedia.org/r/1195352 [21:44:15] FIRING: SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:45:51] (03CR) 10Bking: [C:03+1] Update the definition of @dse_kubepods_networks [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [21:46:32] (03CR) 10RLazarus: [C:03+2] deployment_server: Refactor charlie to add a Service dataclass [puppet] - 10https://gerrit.wikimedia.org/r/1195352 (owner: 10RLazarus) [21:49:47] (03PS1) 10Dzahn: gerrit: adding a network to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1198159 (https://phabricator.wikimedia.org/T408023) [21:53:05] (03CR) 10Dzahn: [C:03+2] gerrit: adding a network to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1198159 (https://phabricator.wikimedia.org/T408023) (owner: 10Dzahn) [21:59:55] (03PS1) 10Dzahn: gerrit: add another IPv6 prefix to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198160 (https://phabricator.wikimedia.org/T408023) [22:00:03] andrew@cumin2002 reimage (PID 4083792) is awaiting input [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T2200) [22:00:12] (03CR) 10CI reject: [V:04-1] gerrit: add another IPv6 prefix to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198160 (https://phabricator.wikimedia.org/T408023) (owner: 10Dzahn) [22:02:20] (03PS2) 10Dzahn: gerrit: add another IPv6 prefix to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198160 (https://phabricator.wikimedia.org/T408023) [22:02:31] (03CR) 10Dzahn: [C:03+2] gerrit: add 2 large prefixes to abusers list (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198157 (https://phabricator.wikimedia.org/T406774) (owner: 10Dzahn) [22:02:49] (03PS5) 10JHathaway: sre.hardware.upgrade-firmware: improve matching for SSD checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [22:03:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078'] [22:03:36] (03CR) 10Dzahn: [C:03+2] gerrit: add another IPv6 prefix to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198160 (https://phabricator.wikimedia.org/T408023) (owner: 10Dzahn) [22:03:38] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2078'] [22:05:20] (03PS1) 10Ryan Kemper: (wip) wdqs: detect blazegraph deadlock [alerts] - 10https://gerrit.wikimedia.org/r/1198161 (https://phabricator.wikimedia.org/T389859) [22:06:42] !log jhathaway@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2058'] [22:07:25] (03PS1) 10Reedy: Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198162 (https://phabricator.wikimedia.org/T405235) [22:07:30] jouncebot: nowandnext [22:07:30] For the next 0 hour(s) and 52 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T2200) [22:07:30] In 7 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T0600) [22:07:30] In 7 hour(s) and 52 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T0600) [22:07:38] (03CR) 10Reedy: [C:03+2] Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198162 (https://phabricator.wikimedia.org/T405235) (owner: 10Reedy) [22:07:43] (03PS1) 10Reedy: Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198163 (https://phabricator.wikimedia.org/T405235) [22:07:51] (03CR) 10Reedy: [C:03+2] Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198163 (https://phabricator.wikimedia.org/T405235) (owner: 10Reedy) [22:08:23] Reedy let me know when you are done. I have a couple of deployments i need to do [22:08:27] TIA [22:08:52] Jdlrobson: These are just maintenance scripts, so a noop for production [22:09:23] so okay for me to proceed? [22:09:30] or do you want to finish up what you are doing first? [22:10:01] You should be god to continue, those will take a little while to get through CI [22:10:32] ok thanks [22:11:00] (03PS3) 10Jdlrobson: [labs] Move namespaces to audience definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194304 (https://phabricator.wikimedia.org/T404152) [22:11:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194304 (https://phabricator.wikimedia.org/T404152) (owner: 10Jdlrobson) [22:12:28] (03Merged) 10jenkins-bot: [labs] Move namespaces to audience definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194304 (https://phabricator.wikimedia.org/T404152) (owner: 10Jdlrobson) [22:12:48] (03PS2) 10Jdlrobson: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841) [22:12:57] !log jhathaway@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1002'] [22:13:53] !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1002'] [22:14:59] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2058'] [22:15:02] (03Merged) 10jenkins-bot: Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198162 (https://phabricator.wikimedia.org/T405235) (owner: 10Reedy) [22:15:11] (03Merged) 10jenkins-bot: Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198163 (https://phabricator.wikimedia.org/T405235) (owner: 10Reedy) [22:15:12] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078'] [22:15:18] !log jhathaway@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [22:15:33] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2078'] [22:15:43] !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003'] [22:17:08] (03PS3) 10Jdlrobson: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841) [22:17:13] !log jhathaway@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2058'] [22:17:17] (03CR) 10Jdlrobson: "Reivisi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841) (owner: 10Jdlrobson) [22:17:24] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078'] [22:17:41] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2058'] [22:19:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841) (owner: 10Jdlrobson) [22:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:19:54] (03Merged) 10jenkins-bot: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841) (owner: 10Jdlrobson) [22:19:56] 06SRE, 10SRE-Access-Requests: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008#11300711 (10thcipriani) As noted in the description, @Urbanecm and I chatted, the rationale for access looks good to me. I approve! [22:20:26] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1194309|Enable QuickSurveys on all wikis (T317841)]] [22:20:30] T317841: Simplify QuickSurveys configuration by enabling everywhere - https://phabricator.wikimedia.org/T317841 [22:23:45] !log jhathaway@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [22:24:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2078'] [22:24:39] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1194309|Enable QuickSurveys on all wikis (T317841)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:25:04] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078'] [22:25:18] !log T407057 - ran mwscript extensions/OATHAuth/maintenance/MoveRecoveryCodesFromTOTP.php --wiki=officewiki [22:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:22] T407057: Run MoveRecoveryCodesFromTOTP.php - https://phabricator.wikimedia.org/T407057 [22:25:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2078'] [22:26:29] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [22:29:24] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003'] [22:30:38] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194309|Enable QuickSurveys on all wikis (T317841)]] (duration: 10m 12s) [22:30:43] T317841: Simplify QuickSurveys configuration by enabling everywhere - https://phabricator.wikimedia.org/T317841 [22:31:37] !log T407057 - ran foreachwikiindblist fishbowl.dblist extensions/OATHAuth/maintenance/MoveRecoveryCodesFromTOTP.php [22:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:41] T407057: Run MoveRecoveryCodesFromTOTP.php - https://phabricator.wikimedia.org/T407057 [22:32:24] !log T407057 - ran foreachwikiindblist private.dblist extensions/OATHAuth/maintenance/MoveRecoveryCodesFromTOTP.php [22:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:43] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078'] [22:33:00] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2078'] [22:34:34] !log jhathaway@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [22:35:01] Done! [22:35:09] !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003'] [22:36:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via next (k8s) 1.446s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:37:55] (03CR) 10JHathaway: sre.hardware.upgrade-firmware: improve matching for SSD checks (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [22:41:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via next (k8s) 1.157s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:49:07] !log T407057 - ran mwscript extensions/OATHAuth/maintenance/MoveRecoveryCodesFromTOTP.php --wiki=metawiki [22:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:12] T407057: Run MoveRecoveryCodesFromTOTP.php - https://phabricator.wikimedia.org/T407057 [23:16:16] (03PS3) 10Reedy: CommonSettings-labs: Remove OATHAuth config that are the same as prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195191 (https://phabricator.wikimedia.org/T404807) [23:16:17] (03PS1) 10Tim Starling: recentchanges: QueryRateEstimator improvements [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198178 (https://phabricator.wikimedia.org/T403798) [23:16:20] (03CR) 10Reedy: [C:03+2] CommonSettings-labs: Remove OATHAuth config that are the same as prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195191 (https://phabricator.wikimedia.org/T404807) (owner: 10Reedy) [23:17:31] (03Merged) 10jenkins-bot: CommonSettings-labs: Remove OATHAuth config that are the same as prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195191 (https://phabricator.wikimedia.org/T404807) (owner: 10Reedy) [23:24:15] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11300946 (10Papaul) While trying to use the firmware upgrade cookbook with "sudo cookbook sre.hardware.upgrade-firmware ms-be2078 --new" i get the error... [23:25:09] (03PS1) 10Reedy: CommonSettings: Remove some OATHAuth config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198180 (https://phabricator.wikimedia.org/T404806) [23:25:27] (03CR) 10Reedy: [C:04-2] "Needs next weeks train to go through" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198180 (https://phabricator.wikimedia.org/T404806) (owner: 10Reedy) [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198181 [23:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198181 (owner: 10TrainBranchBot) [23:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:52:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198181 (owner: 10TrainBranchBot)