[00:00:01] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197366 (owner: 10TrainBranchBot)
[00:03:03] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1011.eqiad.wmnet
[00:03:29] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[00:04:40] <jinxer-wm>	 RESOLVED: DiskSpace: Disk space ml-serve1012:9100:/ 3.068% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[00:04:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[00:05:44] <jinxer-wm>	 FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[00:08:29] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[00:09:15] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:14:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[00:16:25] <sukhe>	 !log sudo ipmitool -I lanplus -H "cp3073.mgmt.esams.wmnet" -U root -E chassis power cycle
[00:16:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:20] <icinga-wm>	 RECOVERY - Host cp3073 is UP: PING OK - Packet loss = 0%, RTA = 80.00 ms
[00:22:36] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db-test1003.eqiad.wmnet with OS trixie
[00:22:40] <icinga-wm>	 PROBLEM - haproxy process on cp3073 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[00:23:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:23:39] <sukhe>	 cp3073 is depooled so no issues
[00:23:42] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3073 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[00:23:42] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3073 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[00:25:40] <icinga-wm>	 RECOVERY - haproxy process on cp3073 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[00:25:42] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3073 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2025-11-14 05:58:19 +0000 (expires in 23 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:25:42] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3073 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-01-07 23:02:02 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:28:10] <jinxer-wm>	 FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:31:21] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[00:33:10] <jinxer-wm>	 FIRING: [7x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:34:28] <logmsgbot>	 !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cp3073.esams.wmnet with reason: depooled
[00:38:10] <jinxer-wm>	 RESOLVED: [8x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:41:39] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[00:54:15] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:00:30] <wikibugs>	 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11296745 (10Andrew) With preseed-test I get different but also bad behavior. Grub works, but the kernel won't boot:   ` Loading Linux 6.12.43+deb13-amd64 ... Loading initial ramdisk ......
[01:05:45] <wikibugs>	 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11296749 (10Andrew) I really need a second config B (or at least 4-drive sw raid) prod server to test this on.
[01:29:15] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:31:32] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[01:34:26] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 4.462 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[02:19:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:14:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197722 (owner: 10Tim Starling)
[03:18:08] <wikibugs>	 (03Merged) 10jenkins-bot: recentchanges: Temporary fix for incubator exception [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197722 (owner: 10Tim Starling)
[03:19:09] <logmsgbot>	 !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1197722|recentchanges: Temporary fix for incubator exception]]
[03:23:34] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:23:42] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1197722|recentchanges: Temporary fix for incubator exception]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[03:24:24] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30040 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:24:36] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Continuing with sync
[03:28:47] <logmsgbot>	 !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197722|recentchanges: Temporary fix for incubator exception]] (duration: 09m 38s)
[03:49:15] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[03:54:15] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:05:43] <jinxer-wm>	 FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[04:09:15] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:31:36] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:36:57] <fabfur>	 !log repooling cp3073 after reboot and removing downtime (T407110)
[04:36:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:37:14] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3073.esams.wmnet
[04:37:30] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp3073.esams.wmnet
[04:37:30] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp3073.esams.wmnet
[04:46:21] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:47:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:52:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:54:15] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:09:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:20:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:25:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:29:15] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:33:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] deployment_server: Prefix `helmfile apply` output with "[service env]" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192282 (owner: 10RLazarus)
[05:34:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:34:15] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:39:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:40:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM, the comment can fully be ignored as it's a volans-like nitpick on coding style." [puppet] - 10https://gerrit.wikimedia.org/r/1195352 (owner: 10RLazarus)
[05:47:02] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11297060 (10Marostegui)
[05:52:00] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add db1264 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1197746 (https://phabricator.wikimedia.org/T407897)
[05:53:12] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Add db1264 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1197746 (https://phabricator.wikimedia.org/T407897)
[05:56:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Add db1264 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1197746 (https://phabricator.wikimedia.org/T407897) (owner: 10Marostegui)
[05:56:50] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11297083 (10Marostegui) >>! In T407897#11295732, @Jhancock.wm wrote: > @Marostegui could you or someone else on the team fill in the needed info for this task and make a...
[05:57:08] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11297084 (10Marostegui)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T0600)
[06:01:34] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:03:32] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 8.941 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:04:50] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add db2249 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1197750 (https://phabricator.wikimedia.org/T407941)
[06:08:03] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Add db2249 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1197750 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui)
[06:19:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:20:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:30:05] <wikibugs>	 (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1197067 (owner: 10L10n-bot)
[06:31:39] <wikibugs>	 (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1197247 (owner: 10L10n-bot)
[06:34:30] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add db1265-db1298 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1197753 (https://phabricator.wikimedia.org/T405273)
[06:40:05] <wikibugs>	 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11297139 (10RKemper) >>! In T393966#11201576, @elukey wrote: > @Gehel @RKemper Hi! A while ago I had a chat wit...
[06:42:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Add db1265-db1298 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1197753 (https://phabricator.wikimedia.org/T405273) (owner: 10Marostegui)
[06:45:58] <jinxer-wm>	 FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:46:10] <marostegui>	 woot
[06:46:13] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[06:46:23] <marostegui>	 !incidents
[06:46:23] <sirenbot>	 6897 (UNACKED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[06:46:24] <sirenbot>	 6898 (UNACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:46:28] <marostegui>	 !ack 6897
[06:46:28] <sirenbot>	 6897 (ACKED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[06:46:29] <marostegui>	 !ack 6898
[06:46:30] <jinxer-wm>	 FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 5 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[06:46:30] <sirenbot>	 6898 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:46:34] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11297146 (10Marostegui)
[06:50:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:51:04] <marostegui>	 !incidents
[06:51:05] <sirenbot>	 6897 (ACKED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[06:51:05] <sirenbot>	 6898 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:51:30] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 7 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[06:52:27] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1197672 (owner: 10Dpogorzelski)
[06:53:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:54:12] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:54:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, I'll let o11y folks vote though" [puppet] - 10https://gerrit.wikimedia.org/r/1197590 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah)
[06:54:27] <marostegui>	 !incidents
[06:54:27] <sirenbot>	 6897 (ACKED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[06:54:28] <sirenbot>	 6898 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:54:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::metricsinfra: Fix thanos::rule usage [puppet] - 10https://gerrit.wikimedia.org/r/1197591 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah)
[06:55:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] cloudceph: set mtu only when interfaces exist [puppet] - 10https://gerrit.wikimedia.org/r/1197245 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi)
[06:55:21] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1197696 (https://phabricator.wikimedia.org/T406927) (owner: 10Kamila Součková)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T0700). nyaa~
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11297158 (10elukey) @Dzahn Hello :) There is no need for apologies, I didn't take it in the bad way, what I was trying to convey is tha...
[07:01:36] <wikibugs>	 (03PS1) 10Krinkle: fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403)
[07:03:43] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::amd_gpu: upgrade trixie hosts to ROCm 7.0.2 repos [puppet] - 10https://gerrit.wikimedia.org/r/1197602 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey)
[07:03:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[07:04:12] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:05:58] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:06:42] <wikibugs>	 (03PS13) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387)
[07:09:07] <wikibugs>	 (03PS1) 10Jelto: aptrepo: update gitlab-ce and gitlab-runner to 18.3 [puppet] - 10https://gerrit.wikimedia.org/r/1197909 (https://phabricator.wikimedia.org/T407943)
[07:10:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:11:30] <jinxer-wm>	 RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 3 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[07:16:13] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[07:17:30] <wikibugs>	 (03CR) 10Jelto: [C:03+2] aptrepo: update gitlab-ce and gitlab-runner to 18.3 [puppet] - 10https://gerrit.wikimedia.org/r/1197909 (https://phabricator.wikimedia.org/T407943) (owner: 10Jelto)
[07:17:44] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt)
[07:18:09] <wikibugs>	 (03PS1) 10Marostegui: db1264: Add 1P note [puppet] - 10https://gerrit.wikimedia.org/r/1197924
[07:18:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1264: Add 1P note [puppet] - 10https://gerrit.wikimedia.org/r/1197924 (owner: 10Marostegui)
[07:19:10] <marostegui>	 elukey: ok to merge?
[07:19:12] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:19:18] <elukey>	 marostegui: yep!
[07:19:27] <marostegui>	 doing it now
[07:21:14] <wikibugs>	 (03CR) 10Volans: deployment_server: Refactor charlie to add a Service dataclass (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195352 (owner: 10RLazarus)
[07:21:36] <wikibugs>	 (03PS1) 10Marostegui: db1262.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197926 (https://phabricator.wikimedia.org/T406550)
[07:22:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1262.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197926 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui)
[07:24:13] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:29:13] <hashar>	 jouncebot: now
[07:29:14] <jouncebot>	 For the next 0 hour(s) and 30 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T0700)
[07:30:28] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr2-drmrs:9804) - https://phabricator.wikimedia.org/T407945 (10LSobanski) 03NEW
[07:30:49] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407946 (10LSobanski) 03NEW
[07:31:28] <wikibugs>	 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11297227 (10fgiunchedi) FWIW yesterday while testing the preseed-test fix for many drives (https://gitlab.wikimedia.org/repos/sre/preseed-test/-/merge_requests/6) I was able to install...
[07:31:45] <wikibugs>	 (03PS3) 10Dpogorzelski: feat: add dpogorzelski user [puppet] - 10https://gerrit.wikimedia.org/r/1197672
[07:44:08] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr2-drmrs:9804) - https://phabricator.wikimedia.org/T407945#11297243 (10cmooney) 05Open→03Resolved a:03cmooney There are other peers to that ASN, these not establishing.  Removed.
[07:44:55] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407946#11297248 (10cmooney) 05Open→03Resolved a:03cmooney There are other sessions to that ASN but they have not configured these two....
[07:49:15] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[07:49:25] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1262 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197977 (https://phabricator.wikimedia.org/T406550)
[07:49:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1262 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197977 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui)
[07:50:14] <wikibugs>	 (03PS1) 10Dpogorzelski: chore: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1197978
[07:52:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db1262 depooled T406550', diff saved to https://phabricator.wikimedia.org/P84211 and previous config saved to /var/cache/conftool/dbconfig/20251022-075234-marostegui.json
[07:52:39] <stashbot>	 T406550: Productionize  db126[0-3] - https://phabricator.wikimedia.org/T406550
[07:54:15] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:55:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84212 and previous config saved to /var/cache/conftool/dbconfig/20251022-075508-root.json
[07:55:43] <jinxer-wm>	 RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:57:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) (owner: 10Seanleong-wmde)
[07:57:58] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005
[07:58:05] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest1005
[07:59:02] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005
[07:59:02] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1005
[07:59:31] <wikibugs>	 (03PS1) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799)
[08:00:05] <jouncebot>	 jelto and hashar: Deploy window Gerrit server reboot (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T0800)
[08:00:05] <jouncebot>	 dancy and andre: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T0800).
[08:00:20] <hashar>	 jelto: I am around :)
[08:00:42] <wikibugs>	 (03PS2) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799)
[08:00:45] <jelto>	 I'm also around, we can coordinate here. Or do you prefer the meet session?
[08:01:08] <hashar>	 meet will work for me as well :)
[08:01:16] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission es1029 [puppet] - 10https://gerrit.wikimedia.org/r/1197980 (https://phabricator.wikimedia.org/T407832)
[08:02:17] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1029.eqiad.wmnet
[08:02:25] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005
[08:02:34] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1005
[08:04:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Decommission es1029 [puppet] - 10https://gerrit.wikimedia.org/r/1197980 (https://phabricator.wikimedia.org/T407832) (owner: 10Marostegui)
[08:08:02] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gerrit1003.wikimedia.org
[08:08:41] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.dns.netbox
[08:09:15] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:10:15] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84213 and previous config saved to /var/cache/conftool/dbconfig/20251022-081014-root.json
[08:13:49] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1029.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
[08:14:10] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1029.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
[08:14:10] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:14:11] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts es1029.eqiad.wmnet
[08:14:12] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:14:13] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:14:58] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832#11297337 (10Marostegui) #dc-ops this host is ready. However the host is still UP due to a ipmi connection failure, but the rest of things have been done and you can proceed to...
[08:15:06] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832#11297341 (10Marostegui)
[08:17:07] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit1003.wikimedia.org
[08:19:12] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:19:13] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:20:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: add XCHS based browser detection routine [puppet] - 10https://gerrit.wikimedia.org/r/1197986 (https://phabricator.wikimedia.org/T404826)
[08:20:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:23:10] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle)
[08:25:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 7%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84214 and previous config saved to /var/cache/conftool/dbconfig/20251022-082521-root.json
[08:28:25] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: add XCHS based browser detection routine [puppet] - 10https://gerrit.wikimedia.org/r/1197986 (https://phabricator.wikimedia.org/T404826)
[08:31:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1053 to es2 primary as es1030 will be decommissioned T406690 T407953', diff saved to https://phabricator.wikimedia.org/P84215 and previous config saved to /var/cache/conftool/dbconfig/20251022-083134-marostegui.json
[08:31:41] <stashbot>	 T406690: Decommission es1026 - es1034 - https://phabricator.wikimedia.org/T406690
[08:31:41] <stashbot>	 T407953: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953
[08:31:54] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1030 T407953', diff saved to https://phabricator.wikimedia.org/P84216 and previous config saved to /var/cache/conftool/dbconfig/20251022-083153-marostegui.json
[08:32:55] <wikibugs>	 (03PS1) 10Marostegui: es1030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197990 (https://phabricator.wikimedia.org/T407953)
[08:33:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197990 (https://phabricator.wikimedia.org/T407953) (owner: 10Marostegui)
[08:36:19] <wikibugs>	 (03PS1) 10Marostegui: installserver: Remove es1052 [puppet] - 10https://gerrit.wikimedia.org/r/1197993
[08:36:36] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:38:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Remove es1052 [puppet] - 10https://gerrit.wikimedia.org/r/1197993 (owner: 10Marostegui)
[08:38:34] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 6.439 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:40:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84217 and previous config saved to /var/cache/conftool/dbconfig/20251022-084027-root.json
[08:40:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:41:02] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] feat: add dpogorzelski user [puppet] - 10https://gerrit.wikimedia.org/r/1197672 (owner: 10Dpogorzelski)
[08:47:45] <wikibugs>	 (03CR) 10Klausman: [C:03+2] feat: add dpogorzelski user [puppet] - 10https://gerrit.wikimedia.org/r/1197672 (owner: 10Dpogorzelski)
[08:48:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add Nokia devices to common.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1196704 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney)
[08:49:10] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "PCC looks good: https://puppet-compiler.wmflabs.org/output/1197986/7366/" [puppet] - 10https://gerrit.wikimedia.org/r/1197986 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto)
[08:54:05] <wikibugs>	 (03CR) 10Klausman: [C:03+1] chore: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1197978 (owner: 10Dpogorzelski)
[08:54:09] <wikibugs>	 (03PS3) 10Federico Ceratto: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859)
[08:54:15] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[08:55:33] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 20%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84218 and previous config saved to /var/cache/conftool/dbconfig/20251022-085533-root.json
[08:56:04] <wikibugs>	 (03PS4) 10Federico Ceratto: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859)
[08:59:20] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1197978 (owner: 10Dpogorzelski)
[08:59:38] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[09:00:55] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy: Deploy private data files and set lua-prepend-path [puppet] - 10https://gerrit.wikimedia.org/r/1197681 (owner: 10Vgutierrez)
[09:01:13] <wikibugs>	 (03CR) 10Klausman: [C:03+2] chore: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1197978 (owner: 10Dpogorzelski)
[09:01:30] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30036 bytes in 0.374 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[09:01:42] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] preseed.yaml: expand regex for sretest100x to include 1005/1006 [puppet] - 10https://gerrit.wikimedia.org/r/1197678 (https://phabricator.wikimedia.org/T405560) (owner: 10Cathal Mooney)
[09:02:52] <wikibugs>	 (03PS1) 10Cathal Mooney: Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577)
[09:04:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[09:04:37] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Reduce weight for db2245 - which was wrong', diff saved to https://phabricator.wikimedia.org/P84219 and previous config saved to /var/cache/conftool/dbconfig/20251022-090437-marostegui.json
[09:06:45] <wikibugs>	 (03CR) 10Btullis: "You will need to bump the chart version, since this is not an override in the helm values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene)
[09:09:39] <wikibugs>	 (03PS3) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799)
[09:09:51] <wikibugs>	 (03PS2) 10Cathal Mooney: Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577)
[09:10:06] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] preseed.yaml: expand regex for sretest100x to include 1005/1006 [puppet] - 10https://gerrit.wikimedia.org/r/1197678 (https://phabricator.wikimedia.org/T405560) (owner: 10Cathal Mooney)
[09:10:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84220 and previous config saved to /var/cache/conftool/dbconfig/20251022-091039-root.json
[09:11:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[09:12:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11297474 (10elukey) @Papaul the issue comes before debian and partman, because when I try to provision the host there is no "hard-disk" option to put as...
[09:12:22] <wikibugs>	 (03CR) 10Btullis: superset: Increase the nginx proxy timeout (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene)
[09:13:44] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] k8s/client_cert: adjust Prometheus certificate renewal timing [puppet] - 10https://gerrit.wikimedia.org/r/1197303 (https://phabricator.wikimedia.org/T407484) (owner: 10Tiziano Fogli)
[09:14:38] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[09:15:37] <wikibugs>	 (03PS1) 10David Caro: dcaro: remove unused old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1198002
[09:16:26] <federico3>	 tappof: can I puppet-merge your pending change "Temporarily longer client certs - https://phabricator.wikimedia.org/T343529"
[09:16:40] <tappof>	 yes federico3, thx
[09:18:24] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1198002 (owner: 10David Caro)
[09:20:08] <wikibugs>	 (03PS4) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799)
[09:20:33] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "chore: add dpogorzelski to ops-limited" [puppet] - 10https://gerrit.wikimedia.org/r/1198003
[09:21:28] <wikibugs>	 (03CR) 10David Caro: [C:03+2] dcaro: remove unused old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1198002 (owner: 10David Caro)
[09:21:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert "chore: add dpogorzelski to ops-limited" [puppet] - 10https://gerrit.wikimedia.org/r/1198003 (owner: 10Vgutierrez)
[09:22:24] <vgutierrez>	 dcaro: merge mine if it's showing on your puppet-merge session
[09:22:42] <dcaro>	 vgutierrez: it did not, almost finished
[09:22:47] <vgutierrez>	 thx
[09:23:00] <dcaro>	 you can go now :)
[09:23:25] <wikibugs>	 (03PS3) 10Cathal Mooney: Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577)
[09:23:55] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove es1030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1198004 (https://phabricator.wikimedia.org/T407953)
[09:25:07] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2057.codfw.wmnet with reason: Setting up new ES host
[09:25:40] <wikibugs>	 (03PS1) 10MVernon: swift: remove ms-be10{89,90} for controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1198005 (https://phabricator.wikimedia.org/T400877)
[09:25:45] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 30%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84221 and previous config saved to /var/cache/conftool/dbconfig/20251022-092545-root.json
[09:26:20] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1198004 (https://phabricator.wikimedia.org/T407953) (owner: 10Marostegui)
[09:27:47] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1030 from dbctl T407953', diff saved to https://phabricator.wikimedia.org/P84222 and previous config saved to /var/cache/conftool/dbconfig/20251022-092747-marostegui.json
[09:27:52] <stashbot>	 T407953: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953
[09:28:39] <wikibugs>	 (03PS4) 10Cathal Mooney: Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577)
[09:30:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:31:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for dpogorzelski - https://phabricator.wikimedia.org/T407955 (10DPogorzelski-WMF) 03NEW
[09:32:21] <wikibugs>	 (03PS1) 10Marostegui: db1251: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198006 (https://phabricator.wikimedia.org/T407463)
[09:32:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1251: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198006 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui)
[09:34:10] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1251.eqiad.wmnet with reason: Maintenance
[09:34:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1251 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84223 and previous config saved to /var/cache/conftool/dbconfig/20251022-093413-marostegui.json
[09:35:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Interface validators: prevent more mistakes on interface naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi)
[09:37:06] <wikibugs>	 (03PS2) 10Ayounsi: Interface validators: prevent more mistakes on interface naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146)
[09:38:43] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] "Yeah, these are not publicly routed. The rules were added before the inconsistency between GET and POST behaviour in the corresponding res" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112797 (https://phabricator.wikimedia.org/T384216) (owner: 10Hnowlan)
[09:39:34] <wikibugs>	 (03PS1) 10Cathal Mooney: config_switch_interfaces: force homer usage if switch is a Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198008
[09:40:51] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84224 and previous config saved to /var/cache/conftool/dbconfig/20251022-094051-root.json
[09:41:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:42:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1251 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84225 and previous config saved to /var/cache/conftool/dbconfig/20251022-094213-root.json
[09:47:59] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone_es of es2034.codfw.wmnet onto es2057.codfw.wmnet
[09:48:00] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005
[09:48:04] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2034 - Depool es2034.codfw.wmnet to then clone it to es2057.codfw.wmnet - fceratto@cumin1003
[09:48:09] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1005
[09:48:21] <wikibugs>	 (03PS5) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799)
[09:48:22] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2034 - Depool es2034.codfw.wmnet to then clone it to es2057.codfw.wmnet - fceratto@cumin1003
[09:48:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] varnish: add XCHS based browser detection routine [puppet] - 10https://gerrit.wikimedia.org/r/1197986 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto)
[09:50:12] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on es1030.eqiad.wmnet with reason: Decommissioning
[09:50:12] <marostegui>	 !log Stop mariadb on es1030 for decommissioning T407953
[09:50:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:19] <stashbot>	 T407953: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953
[09:51:22] <logmsgbot>	 fceratto@cumin1003 clone_es (PID 2441330) is awaiting input
[09:51:48] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005
[09:51:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:52:09] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1005
[09:52:48] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] swift: remove ms-be10{89,90} for controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1198005 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon)
[09:53:58] <wikibugs>	 (03CR) 10Btullis: superset: Increase the nginx proxy timeout (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene)
[09:55:40] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[09:55:57] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 60%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84228 and previous config saved to /var/cache/conftool/dbconfig/20251022-095557-root.json
[09:56:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:57:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1251 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84229 and previous config saved to /var/cache/conftool/dbconfig/20251022-095719-root.json
[09:57:35] <wikibugs>	 (03PS1) 10Marostegui: db1263: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1198009 (https://phabricator.wikimedia.org/T406550)
[09:58:32] <wikibugs>	 (03PS6) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799)
[09:58:46] <logmsgbot>	 cmooney@cumin1003 provision (PID 2450330) is awaiting input
[09:58:58] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[09:59:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1263: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1198009 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui)
[09:59:51] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra: add two Xlab SLOs under the data-platform namespace [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1000)
[10:01:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:02:12] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Set Alias entity usage modifier limit to 10. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) (owner: 10Seanleong-wmde)
[10:05:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:06:17] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1263 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1198012 (https://phabricator.wikimedia.org/T406550)
[10:07:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:07:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1263 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1198012 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui)
[10:08:04] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: remove ms-be10{89,90} for controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1198005 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon)
[10:08:48] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Record LDAP access for lsandergreen. [puppet] - 10https://gerrit.wikimedia.org/r/1197696 (https://phabricator.wikimedia.org/T406927) (owner: 10Kamila Součková)
[10:09:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db1263 to dbctl depooled T406550', diff saved to https://phabricator.wikimedia.org/P84230 and previous config saved to /var/cache/conftool/dbconfig/20251022-100920-marostegui.json
[10:09:26] <stashbot>	 T406550: Productionize  db126[0-3] - https://phabricator.wikimedia.org/T406550
[10:10:07] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] admin: add yubikey ed25519-sk ssh key to user dzahn [puppet] - 10https://gerrit.wikimedia.org/r/1197720 (https://phabricator.wikimedia.org/T407917) (owner: 10Dzahn)
[10:11:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84231 and previous config saved to /var/cache/conftool/dbconfig/20251022-101103-root.json
[10:11:37] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] url_downloader: remove hcaptcha proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh)
[10:11:43] <wikibugs>	 (03PS3) 10Effie Mouzeli: etcd::tlsproxy: Remove testserver ACLs 2 [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498)
[10:12:06] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:12:11] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli)
[10:12:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1251 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84232 and previous config saved to /var/cache/conftool/dbconfig/20251022-101225-root.json
[10:12:28] <wikibugs>	 (03PS3) 10Effie Mouzeli: conftool-data: remove  testservers 3 [puppet] - 10https://gerrit.wikimedia.org/r/1173877 (https://phabricator.wikimedia.org/T397498)
[10:14:44] <wikibugs>	 (03PS7) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799)
[10:14:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:16:14] <wikibugs>	 (03CR) 10Btullis: superset: Increase the nginx proxy timeout (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene)
[10:17:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:19:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:19:50] <wikibugs>	 (03PS8) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799)
[10:22:06] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:24:44] <wikibugs>	 (03PS28) 10Clément Goubert: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler)
[10:25:45] <wikibugs>	 (03PS4) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490)
[10:26:10] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1262 (re)pooling @ 1000%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84233 and previous config saved to /var/cache/conftool/dbconfig/20251022-102609-root.json
[10:27:07] <wikibugs>	 (03PS2) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737)
[10:27:29] <wikibugs>	 (03CR) 10Phuedx: Add config for xLab MW Module experiment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming)
[10:27:32] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1251 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84234 and previous config saved to /var/cache/conftool/dbconfig/20251022-102732-root.json
[10:27:32] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] Add config for xLab MW Module experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming)
[10:27:46] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic (T405631)
[10:28:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:28:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: cfssl-ocsprefresh-wikikube_staging.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:28:37] <wikibugs>	 (03PS2) 10Elukey: profile::pyrra: add two Xlab SLOs under the data-platform namespace [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869)
[10:29:11] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic (T405631)
[10:31:06] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:31:13] <wikibugs>	 (03CR) 10Elukey: "Left a comment related to the numerator metrics, lemme know :)" [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey)
[10:31:18] <wikibugs>	 (03PS1) 10Marco Fossati: Deploy the ReaderExperiments extension to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198016 (https://phabricator.wikimedia.org/T406907)
[10:31:38] <wikibugs>	 (03PS5) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490)
[10:32:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198016 (https://phabricator.wikimedia.org/T406907) (owner: 10Marco Fossati)
[10:35:02] <Dreamy_Jazz>	 jouncebot: nowandnext
[10:35:02] <jouncebot>	 For the next 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1000)
[10:35:02] <jouncebot>	 In 0 hour(s) and 24 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1100)
[10:35:11] <Dreamy_Jazz>	 Anyone using this window?
[10:35:49] <wikibugs>	 (03PS1) 10Dreamy Jazz: Fix abuse_filter_log index in TempUserIPLookup [extensions/IPInfo] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198017 (https://phabricator.wikimedia.org/T400280)
[10:35:57] <wikibugs>	 (03PS1) 10Dreamy Jazz: Fix abuse_filter_log index in TempUserIPLookup [extensions/IPInfo] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198018 (https://phabricator.wikimedia.org/T400280)
[10:38:19] <wikibugs>	 (03PS4) 10Effie Mouzeli: conftool-data: remove  testservers 3 [puppet] - 10https://gerrit.wikimedia.org/r/1173877 (https://phabricator.wikimedia.org/T397498)
[10:38:20] <wikibugs>	 (03PS1) 10Effie Mouzeli: scap: remove testservers 4 [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498)
[10:38:41] <Dreamy_Jazz>	 Going to proceed with a deploy
[10:38:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/IPInfo] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198017 (https://phabricator.wikimedia.org/T400280) (owner: 10Dreamy Jazz)
[10:38:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/IPInfo] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198018 (https://phabricator.wikimedia.org/T400280) (owner: 10Dreamy Jazz)
[10:39:09] <Dreamy_Jazz>	 I should be able to abort that if someone needs scap for the window in the next few mins
[10:39:53] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad (T405631)
[10:40:25] <wikibugs>	 (03CR) 10Stevemunene: "> You will need to bump the chart version, since this is not an override in the helm values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene)
[10:40:26] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad (T405631)
[10:40:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] scap: remove testservers 4 [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli)
[10:41:01] <wikibugs>	 (03PS2) 10Effie Mouzeli: scap: remove testservers 4 [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498)
[10:43:11] <wikibugs>	 (03CR) 10Elukey: [C:03+1] config_switch_interfaces: force homer usage if switch is a Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198008 (owner: 10Cathal Mooney)
[10:44:14] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica
[10:45:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Decrease db1262 weight', diff saved to https://phabricator.wikimedia.org/P84235 and previous config saved to /var/cache/conftool/dbconfig/20251022-104530-marostegui.json
[10:46:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Decrease es2028 weight', diff saved to https://phabricator.wikimedia.org/P84236 and previous config saved to /var/cache/conftool/dbconfig/20251022-104601-marostegui.json
[10:47:25] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84237 and previous config saved to /var/cache/conftool/dbconfig/20251022-104724-root.json
[10:48:18] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T405631)
[10:48:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:49:01] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T405631)
[10:50:55] <wikibugs>	 (03PS1) 10Marostegui: db2146: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198020 (https://phabricator.wikimedia.org/T407463)
[10:50:57] <wikibugs>	 (03Merged) 10jenkins-bot: Fix abuse_filter_log index in TempUserIPLookup [extensions/IPInfo] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198017 (https://phabricator.wikimedia.org/T400280) (owner: 10Dreamy Jazz)
[10:51:35] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2146: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198020 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui)
[10:51:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:52:14] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] config_switch_interfaces: force homer usage if switch is a Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198008 (owner: 10Cathal Mooney)
[10:52:51] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2146.codfw.wmnet with reason: Maintenance
[10:52:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2146 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84238 and previous config saved to /var/cache/conftool/dbconfig/20251022-105255-marostegui.json
[10:54:01] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica
[10:54:22] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica
[10:57:45] <wikibugs>	 (03Merged) 10jenkins-bot: Fix abuse_filter_log index in TempUserIPLookup [extensions/IPInfo] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198018 (https://phabricator.wikimedia.org/T400280) (owner: 10Dreamy Jazz)
[10:57:47] <wikibugs>	 (03Merged) 10jenkins-bot: config_switch_interfaces: force homer usage if switch is a Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198008 (owner: 10Cathal Mooney)
[10:58:10] <wikibugs>	 (03CR) 10Clément Goubert: "Last patch was a rebase only patch, restoring the +1 from @hnowlan@wikimedia.org" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler)
[10:58:24] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler)
[10:58:32] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1198017|Fix abuse_filter_log index in TempUserIPLookup (T400280)]], [[gerrit:1198018|Fix abuse_filter_log index in TempUserIPLookup (T400280)]]
[10:58:37] <stashbot>	 T400280: Drop `afl_ip` as the last step of the migration to `afl_ip_hex` - https://phabricator.wikimedia.org/T400280
[10:58:52] <wikibugs>	 (03CR) 10Hnowlan: "lgtm. some musings/nits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert)
[10:59:53] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler)
[10:59:56] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] etcd::tlsproxy: Remove testserver ACLs 2 [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli)
[11:00:04] <jouncebot>	 mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1100).
[11:00:19] <wikibugs>	 (03PS1) 10Kosta Harlan: EventStreamConfig: Don't collect user-agent for suggested_investigations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198021 (https://phabricator.wikimedia.org/T404177)
[11:00:48] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] conftool-data: remove  testservers 3 [puppet] - 10https://gerrit.wikimedia.org/r/1173877 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli)
[11:01:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84239 and previous config saved to /var/cache/conftool/dbconfig/20251022-110111-root.json
[11:02:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84240 and previous config saved to /var/cache/conftool/dbconfig/20251022-110230-root.json
[11:03:01] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1198017|Fix abuse_filter_log index in TempUserIPLookup (T400280)]], [[gerrit:1198018|Fix abuse_filter_log index in TempUserIPLookup (T400280)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[11:03:13] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert)
[11:04:06] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica
[11:04:23] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[11:04:40] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab
[11:05:39] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:05:44] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:06:32] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2203.codfw.wmnet
[11:07:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[11:08:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[11:08:34] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198017|Fix abuse_filter_log index in TempUserIPLookup (T400280)]], [[gerrit:1198018|Fix abuse_filter_log index in TempUserIPLookup (T400280)]] (duration: 10m 01s)
[11:08:38] <stashbot>	 T400280: Drop `afl_ip` as the last step of the migration to `afl_ip_hex` - https://phabricator.wikimedia.org/T400280
[11:08:48] <Dreamy_Jazz>	 I'm done with my deploy
[11:09:11] <logmsgbot>	 !log kamila@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host wikikube-worker2203.codfw.wmnet
[11:09:28] <wikibugs>	 (03PS2) 10Michael Große: beta: Enable ReviseTone Structured Task on enwiki,frwiki,arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176)
[11:10:29] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196638 (owner: 10PipelineBot)
[11:10:53] <wikibugs>	 (03PS9) 10Btullis: Migrate the refine_netflow job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943)
[11:12:02] <wikibugs>	 (03PS2) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196672
[11:12:20] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196638 (owner: 10PipelineBot)
[11:12:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:12:37] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] "Thanks, didn't notice this part was needed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198021 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan)
[11:12:43] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Migrate the refine_netflow job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[11:12:58] <Dreamy_Jazz>	 jouncebot: nowandnext
[11:12:58] <jouncebot>	 For the next 0 hour(s) and 47 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1100)
[11:12:58] <jouncebot>	 In 1 hour(s) and 47 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1300)
[11:13:14] <Dreamy_Jazz>	 Anyone using this window too?
[11:13:21] <Dreamy_Jazz>	 Got another backport, but should be shorter
[11:14:06] <Mvolz>	 I was planning to use it, but it shouldn't interfere with a mediawiki deploy since it's k8
[11:14:17] <Dreamy_Jazz>	 Okay, thanks
[11:14:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[11:14:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198021 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan)
[11:14:50] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[11:14:59] <claime>	 I'm deploying some api/rest gateway patches but shouldn't interfere either
[11:15:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[11:15:24] <Dreamy_Jazz>	 Thanks, this one should go faster
[11:15:38] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: Don't collect user-agent for suggested_investigations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198021 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan)
[11:15:44] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:15:49] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:15:50] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[11:16:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:16:11] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1198021|EventStreamConfig: Don't collect user-agent for suggested_investigations_interaction (T404177)]]
[11:16:16] <stashbot>	 T404177: Instrumentation for Suggested Investigations - https://phabricator.wikimedia.org/T404177
[11:16:18] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84241 and previous config saved to /var/cache/conftool/dbconfig/20251022-111617-root.json
[11:16:34] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1006
[11:16:39] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1006
[11:17:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:17:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 7%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84242 and previous config saved to /var/cache/conftool/dbconfig/20251022-111736-root.json
[11:18:18] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:18:33] <wikibugs>	 (03PS4) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196672 (owner: 10PipelineBot)
[11:19:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:20:16] <wikibugs>	 (03PS1) 10Cathal Mooney: sre.hosts.provision: adjust to always use Homer to config Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198026
[11:20:20] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:20:24] <logmsgbot>	 !log dreamyjazz@deploy2002 kharlan, dreamyjazz: Backport for [[gerrit:1198021|EventStreamConfig: Don't collect user-agent for suggested_investigations_interaction (T404177)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[11:20:53] <logmsgbot>	 !log dreamyjazz@deploy2002 kharlan, dreamyjazz: Continuing with sync
[11:21:50] <wikibugs>	 (03CR) 10Kamila Součková: [C:04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine)
[11:22:08] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] rest-gateway: Deploy rate limiting in staging (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert)
[11:22:19] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196672 (owner: 10PipelineBot)
[11:24:10] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196672 (owner: 10PipelineBot)
[11:24:15] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:24:24] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply
[11:24:53] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply
[11:25:00] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198021|EventStreamConfig: Don't collect user-agent for suggested_investigations_interaction (T404177)]] (duration: 08m 48s)
[11:25:04] <stashbot>	 T404177: Instrumentation for Suggested Investigations - https://phabricator.wikimedia.org/T404177
[11:25:07] <Dreamy_Jazz>	 I'm done
[11:25:08] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1006
[11:25:13] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1006
[11:25:15] <wikibugs>	 (03PS1) 10Marostegui: db1196: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198027 (https://phabricator.wikimedia.org/T407463)
[11:25:18] <wikibugs>	 (03Abandoned) 10Kamila Součková: proxoid: add discovery SAN [puppet] - 10https://gerrit.wikimedia.org/r/1196954 (https://phabricator.wikimedia.org/T407615) (owner: 10Kamila Součková)
[11:26:07] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: Upgrading
[11:26:31] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1196: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198027 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui)
[11:26:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:26:59] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:27:28] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[11:27:33] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1196 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84243 and previous config saved to /var/cache/conftool/dbconfig/20251022-112732-marostegui.json
[11:27:49] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply
[11:28:07] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply
[11:28:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: cfssl-ocsprefresh-wikikube_staging.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:29:30] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply
[11:30:02] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[11:30:31] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply
[11:30:56] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[11:31:24] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84244 and previous config saved to /var/cache/conftool/dbconfig/20251022-113123-root.json
[11:32:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84245 and previous config saved to /var/cache/conftool/dbconfig/20251022-113243-root.json
[11:35:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84246 and previous config saved to /var/cache/conftool/dbconfig/20251022-113521-root.json
[11:37:51] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11297888 (10MatthewVernon)
[11:40:10] <logmsgbot>	 !log mvernon@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ms-be[1089-1090].eqiad.wmnet with reason: awaiting controller swap
[11:40:18] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11297900 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cea00150-47a1-46ce-a142-ec46d9e47678) set by mvernon@cumin1003 for 3 days, 0:...
[11:40:21] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11297901 (10MatthewVernon) @VRiley-WMF the last two nodes ms-be1089 and ms-be1090 are ready for controller swap, please; I've downtimed them for a couple...
[11:42:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] gnmic: add collection for Nokia OSPF states [puppet] - 10https://gerrit.wikimedia.org/r/1196714 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney)
[11:42:17] <wikibugs>	 (03PS2) 10Phuedx: EventStreamConfig: Remove mediawiki.reference_previews stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197659 (https://phabricator.wikimedia.org/T242127)
[11:43:37] <wikibugs>	 (03PS3) 10Cathal Mooney: gnmic: Adjust BGP collection for Nokia compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1196917 (https://phabricator.wikimedia.org/T405558)
[11:45:21] <wikibugs>	 (03PS1) 10Phuedx: EventStreamConfig: Remove wikibase.client.interaction stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198029 (https://phabricator.wikimedia.org/T370045)
[11:46:30] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84247 and previous config saved to /var/cache/conftool/dbconfig/20251022-114629-root.json
[11:46:40] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:47:50] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 20%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84248 and previous config saved to /var/cache/conftool/dbconfig/20251022-114749-root.json
[11:48:13] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:48:30] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[11:48:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] gnmic: Adjust BGP collection for Nokia compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1196917 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney)
[11:49:15] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[11:50:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84249 and previous config saved to /var/cache/conftool/dbconfig/20251022-115027-root.json
[11:56:32] <wikibugs>	 (03CR) 10Btullis: [C:03+1] superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene)
[11:56:46] <wikibugs>	 (03CR) 10Btullis: [C:03+1] superset: Increase the nginx proxy timeout (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene)
[11:58:30] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[12:00:48] <logmsgbot>	 jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade.
[12:02:38] <wikibugs>	 (03PS1) 10Marostegui: db1184: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198030 (https://phabricator.wikimedia.org/T407463)
[12:02:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84251 and previous config saved to /var/cache/conftool/dbconfig/20251022-120256-root.json
[12:03:00] <wikibugs>	 (03PS9) 10Stevemunene: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799)
[12:03:08] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.remove-downtime for ssw1-d1-eqiad
[12:03:08] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ssw1-d1-eqiad
[12:05:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84252 and previous config saved to /var/cache/conftool/dbconfig/20251022-120533-root.json
[12:06:25] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1184: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1198030 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui)
[12:08:49] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[12:08:54] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1184 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84253 and previous config saved to /var/cache/conftool/dbconfig/20251022-120853-marostegui.json
[12:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:11:04] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab
[12:11:33] <wikibugs>	 (03PS1) 10Kamila Součková: admin: add lsandergreen to fr-tech-devs, add ssh [puppet] - 10https://gerrit.wikimedia.org/r/1198033 (https://phabricator.wikimedia.org/T406927)
[12:12:00] <wikibugs>	 (03PS1) 10Cathal Mooney: Netops BGP alert: make core bgp group names to be case insensitive [alerts] - 10https://gerrit.wikimedia.org/r/1198034 (https://phabricator.wikimedia.org/T405558)
[12:12:33] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198033 (https://phabricator.wikimedia.org/T406927) (owner: 10Kamila Součková)
[12:14:40] <wikibugs>	 (03PS10) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469)
[12:14:50] <wikibugs>	 (03PS1) 10Effie Mouzeli: site.pp: bye bye mwdebugXXXX 5 [puppet] - 10https://gerrit.wikimedia.org/r/1198035
[12:14:56] <wikibugs>	 (03PS3) 10DCausse: cirrus: enable completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197642 (https://phabricator.wikimedia.org/T404858)
[12:15:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:15:48] <wikibugs>	 (03PS2) 10Effie Mouzeli: site.pp: bye bye mwdebugXXXX 5 [puppet] - 10https://gerrit.wikimedia.org/r/1198035 (https://phabricator.wikimedia.org/T397498)
[12:17:08] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84254 and previous config saved to /var/cache/conftool/dbconfig/20251022-121707-root.json
[12:18:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 30%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84255 and previous config saved to /var/cache/conftool/dbconfig/20251022-121802-root.json
[12:18:15] <wikibugs>	 (03CR) 10Marostegui: major-upgrade.py: MariaDB major version upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto)
[12:19:02] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:19:07] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:19:15] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:20:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84256 and previous config saved to /var/cache/conftool/dbconfig/20251022-122039-root.json
[12:21:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto)
[12:25:10] <wikibugs>	 (03PS1) 10Cory Massaro: Wikifunctions: Upgrade orchestrator from 2025-10-14-194525 to 2025-10-22-011302. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198036 (https://phabricator.wikimedia.org/T381060)
[12:27:43] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1198037 (https://phabricator.wikimedia.org/T407975)
[12:28:32] <wikibugs>	 (03PS3) 10Michael Große: beta: Enable ReviseTone Structured Task on enwiki,frwiki,arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176)
[12:30:08] <Reedy>	 jouncebot: nownandnext
[12:31:10] <wikibugs>	 (03PS6) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490)
[12:31:34] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert)
[12:32:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84257 and previous config saved to /var/cache/conftool/dbconfig/20251022-123213-root.json
[12:32:21] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: Deploy rate limiting in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert)
[12:32:21] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:32:26] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:32:29] <wikibugs>	 (03PS1) 10Cory Massaro: Update function-evaluators from 2025-10-15-120631 to 2025-10-21-143846. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198039 (https://phabricator.wikimedia.org/T381060)
[12:32:53] <wikibugs>	 (03PS11) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469)
[12:33:07] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11298079 (10seanleong-WMDE) NDA signed on my end. Thanks!
[12:33:08] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84258 and previous config saved to /var/cache/conftool/dbconfig/20251022-123308-root.json
[12:33:09] <Lucas_WMDE>	 jouncebot: now
[12:33:10] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 26 minute(s)
[12:33:21] <Lucas_WMDE>	 o_O why didn’t it respond to reedy? or am I not seeing it
[12:33:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11298080 (10MatthewVernon) (to answer the question - like all ms-* nodes, this will continue to be Debian 11 for now, although we might use it for a test...
[12:33:38] <wikibugs>	 (03PS2) 10Cory Massaro: Wikifunctions: Update function-evaluators from 2025-10-15-120631 to 2025-10-21-143846. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198039 (https://phabricator.wikimedia.org/T381060)
[12:34:40] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: Move toolviews processing to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1197308 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah)
[12:36:23] <taavi>	 Lucas_WMDE: there's a typo in R.eedy's command
[12:37:02] <logmsgbot>	 cmooney@cumin1003 provision (PID 2610964) is awaiting input
[12:37:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: New logo; rate-limit by wmfuniq [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1198041
[12:37:34] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:37:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] New logo; rate-limit by wmfuniq [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1198041 (owner: 10Giuseppe Lavagetto)
[12:38:21] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:38:37] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:38:38] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Rate-limit by wmfuniq - oblivian@cumin1003"
[12:38:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11298106 (10elukey) @Papaul @Jhancock.wm I went into System Setup (F2) -> Device -> Raid controller and used the erase function on both 480GB SSDs, clear...
[12:38:40] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Rate-limit by wmfuniq - oblivian@cumin1003
[12:39:14] <wikibugs>	 (03CR) 10Matthias Mullie: [C:03+1] Deploy the ReaderExperiments extension to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198016 (https://phabricator.wikimedia.org/T406907) (owner: 10Marco Fossati)
[12:39:26] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Rate-limit by wmfuniq - oblivian@cumin1003
[12:39:27] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Rate-limit by wmfuniq - oblivian@cumin1003"
[12:39:59] <wikibugs>	 (03PS1) 10Reedy: CommonSettings.php: Set $wgOATHRecoveryCodesCount = 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198042 (https://phabricator.wikimedia.org/T407167)
[12:40:02] <Lucas_WMDE>	 taavi: ah ^^
[12:40:19] <Lucas_WMDE>	 and I guess jouncebot doesn’t reply “I don’t understand” like some other bots do (stashbot?)
[12:40:30] <Lucas_WMDE>	 (nope wasn’t stashbot apparently ^^)
[12:40:30] <Reedy>	 it should do string distance and work out if it's close enough to a command it knows
[12:40:49] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:40:50] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:41:06] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_codfw and A:cp
[12:41:20] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_codfw and A:cp
[12:42:31] <wikibugs>	 (03CR) 10Reedy: [C:03+2] CommonSettings.php: Set $wgOATHRecoveryCodesCount = 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198042 (https://phabricator.wikimedia.org/T407167) (owner: 10Reedy)
[12:43:22] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings.php: Set $wgOATHRecoveryCodesCount = 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198042 (https://phabricator.wikimedia.org/T407167) (owner: 10Reedy)
[12:43:53] <logmsgbot>	 cmooney@cumin1003 provision (PID 2617100) is awaiting input
[12:44:38] <wikibugs>	 (03PS1) 10Majavah: toolforge: toolviews: Drop nginx support [puppet] - 10https://gerrit.wikimedia.org/r/1198045 (https://phabricator.wikimedia.org/T284558)
[12:45:42] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:45:48] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:47:00] <wikibugs>	 (03CR) 10Phuedx: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841) (owner: 10Awight)
[12:47:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84259 and previous config saved to /var/cache/conftool/dbconfig/20251022-124720-root.json
[12:48:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 60%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84260 and previous config saved to /var/cache/conftool/dbconfig/20251022-124814-root.json
[12:48:43] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:48:48] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:50:56] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2028.codfw.wmnet
[12:53:35] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2027.codfw.wmnet
[12:53:49] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[12:54:02] <logmsgbot>	 !log reedy@deploy2002 Synchronized wmf-config/CommonSettings.php: T407167 (duration: 08m 29s)
[12:54:07] <stashbot>	 T407167: Only One Recovery codes given - https://phabricator.wikimedia.org/T407167
[12:54:15] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:55:39] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Drop proxy IP rate limit exemption [puppet] - 10https://gerrit.wikimedia.org/r/1198049 (https://phabricator.wikimedia.org/T283948)
[12:55:41] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: Remove separate proxy role [puppet] - 10https://gerrit.wikimedia.org/r/1198050 (https://phabricator.wikimedia.org/T283948)
[12:55:43] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: Remove long-obsolete proxylistener systemd unit code [puppet] - 10https://gerrit.wikimedia.org/r/1198051
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1300).
[13:00:05] <jouncebot>	 seanleong-wmde and mfossati: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:09] <Lucas_WMDE>	 o/
[13:00:16] <Lucas_WMDE>	 I can probably deploy in a few minutes but want to finish a code review first
[13:00:40] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] admin: add lsandergreen to fr-tech-devs, add ssh [puppet] - 10https://gerrit.wikimedia.org/r/1198033 (https://phabricator.wikimedia.org/T406927) (owner: 10Kamila Součková)
[13:00:45] <mfossati>	 hi there!
[13:01:06] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] Netops BGP alert: make core bgp group names to be case insensitive [alerts] - 10https://gerrit.wikimedia.org/r/1198034 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney)
[13:01:25] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:01:26] <mfossati>	 I can self-deploy
[13:01:44] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Netops BGP alert: make core bgp group names to be case insensitive [alerts] - 10https://gerrit.wikimedia.org/r/1198034 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney)
[13:02:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84262 and previous config saved to /var/cache/conftool/dbconfig/20251022-130226-root.json
[13:02:49] <Lucas_WMDE>	 mfossati: go ahead :)
[13:02:59] <mfossati>	 all right
[13:03:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84263 and previous config saved to /var/cache/conftool/dbconfig/20251022-130320-root.json
[13:03:22] <wikibugs>	 (03Merged) 10jenkins-bot: Netops BGP alert: make core bgp group names to be case insensitive [alerts] - 10https://gerrit.wikimedia.org/r/1198034 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney)
[13:03:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198016 (https://phabricator.wikimedia.org/T406907) (owner: 10Marco Fossati)
[13:03:28] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie
[13:03:39] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:04:16] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy the ReaderExperiments extension to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198016 (https://phabricator.wikimedia.org/T406907) (owner: 10Marco Fossati)
[13:04:46] <logmsgbot>	 !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1198016|Deploy the ReaderExperiments extension to English Wikipedia (T406907)]]
[13:04:50] <stashbot>	 T406907: Reader Experiments: Deploy extension to English Wikipedia - https://phabricator.wikimedia.org/T406907
[13:06:45] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:08:52] <logmsgbot>	 elukey@cumin1003 reimage (PID 2640471) is awaiting input
[13:09:15] <logmsgbot>	 !log mfossati@deploy2002 mfossati: Backport for [[gerrit:1198016|Deploy the ReaderExperiments extension to English Wikipedia (T406907)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:09:35] <mfossati>	 Let me check
[13:10:05] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie
[13:10:58] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:11:07] <mfossati>	 it works!
[13:11:11] <logmsgbot>	 !log mfossati@deploy2002 mfossati: Continuing with sync
[13:12:45] <wikibugs>	 (03CR) 10Urbanecm: "question: what happens if we enable Revise Tone _without_ edit check?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) (owner: 10Michael Große)
[13:15:18] <logmsgbot>	 !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198016|Deploy the ReaderExperiments extension to English Wikipedia (T406907)]] (duration: 10m 32s)
[13:15:22] <stashbot>	 T406907: Reader Experiments: Deploy extension to English Wikipedia - https://phabricator.wikimedia.org/T406907
[13:15:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11298181 (10elukey) Provisioned the host, retried a reimage, but it didn't boot in d-i. I checked on the DCHP server:  ` elukey@install2005:~$ sudo journalctl -u isc-dhcp-server.s...
[13:16:09] <mfossati>	 Lucas_WMDE: all done here :-)
[13:16:12] <Lucas_WMDE>	 \o/
[13:16:27] <Lucas_WMDE>	 I’ll wait for ca. 15 minutes to see in sean shows up, calendar says he might be in a meeting at the moment
[13:18:19] <seanleong-wmde>	 Hi, sorry I am late, is the deployment still ongoing?
[13:18:20] <Lucas_WMDE>	 hi seanleong-wmde!
[13:18:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1263 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84264 and previous config saved to /var/cache/conftool/dbconfig/20251022-131826-root.json
[13:18:31] <Lucas_WMDE>	 yes, we can deploy now
[13:18:49] <seanleong-wmde>	 Okay! I missed the ytd's one as well, sorry :/
[13:19:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) (owner: 10Seanleong-wmde)
[13:19:56] <wikibugs>	 (03Merged) 10jenkins-bot: Set Alias entity usage modifier limit to 10. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) (owner: 10Seanleong-wmde)
[13:20:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1180523|Set Alias entity usage modifier limit to 10. (T401288)]]
[13:20:37] <stashbot>	 T401288: Implement a more granular alias usage tracking - https://phabricator.wikimedia.org/T401288
[13:21:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1012.eqiad.wmnet, repooling both afterwards
[13:21:49] <stashbot>	 T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920
[13:22:46] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:24:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11298199 (10elukey) I think If found a possible lead - `HTTPSBootChecksHostname` may be the problem, since we use a bare IP when doing the HTTP boot. I am not able to set "Disable...
[13:24:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11298201 (10Marostegui) I've expanded their logical volume to use most of the disk as we normally do ` root@clouddb1022:~# pvs   PV         VG   Fmt  Attr PS...
[13:25:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 seanleong-wmde, lucaswerkmeister-wmde: Backport for [[gerrit:1180523|Set Alias entity usage modifier limit to 10. (T401288)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:25:46] <seanleong-wmde>	 testing now
[13:25:46] <Lucas_WMDE>	 seanleong-wmde: please test :)
[13:25:49] <Lucas_WMDE>	 ok
[13:25:50] <seanleong-wmde>	 okie
[13:26:11] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:26:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1012.eqiad.wmnet, repooling both afterwards
[13:27:14] <logmsgbot>	 !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ssw1-d1-eqiad with reason: downtime ssw1-d1-eqiad until we have the monitoring checks fully working for the new platform
[13:27:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11298210 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3ced65be-cbbb-4ba9-91b3-b0f2c626ba79) set by cmo...
[13:28:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1013.eqiad.wmnet, repooling both afterwards
[13:28:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11298212 (10Papaul) @elukey @MatthewVernon thank you that was very helpful information. Now I can answer you question  "In UEFI Boot Mode, fixed media (s...
[13:28:47] <stashbot>	 T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920
[13:29:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11298214 (10elukey) @Jhancock.wm Hi! The host seems stuck again after trying `reset /system1/pwrmgtsvc1`, it feels like there is something wrong with the host. What do you think?
[13:29:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832#11298217 (10Jclark-ctr) 05Open→03Resolved a:05Marostegui→03Jclark-ctr
[13:29:36] <wikibugs>	 (03CR) 10Ssingh: dnsrecursor: use config dir instead of standalone file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins)
[13:29:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832#11298221 (10Jclark-ctr)
[13:30:49] <wikibugs>	 (03PS1) 10Marostegui: clouddb102[25]: Add hieradata file [puppet] - 10https://gerrit.wikimedia.org/r/1198064 (https://phabricator.wikimedia.org/T393733)
[13:31:44] <seanleong-wmde>	 Looks good so far
[13:31:53] <Lucas_WMDE>	 ok
[13:32:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11298231 (10elukey) @Papaul this is true, the debian installer is the one that eventually sets the proper boot disk, but in all other models we have a ge...
[13:32:02] <Lucas_WMDE>	 did you find a page with >10 alias usages?
[13:32:09] <Lucas_WMDE>	 (nothing in mwdebug so far)
[13:32:12] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2030.codfw.wmnet
[13:32:21] <wikibugs>	 (03CR) 10FNegri: [C:03+1] clouddb102[25]: Add hieradata file [puppet] - 10https://gerrit.wikimedia.org/r/1198064 (https://phabricator.wikimedia.org/T393733) (owner: 10Marostegui)
[13:32:24] <seanleong-wmde>	 I created a module for it and it's working, but not sure about current pages
[13:32:28] <Lucas_WMDE>	 ah, ok
[13:32:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] clouddb102[25]: Add hieradata file [puppet] - 10https://gerrit.wikimedia.org/r/1198064 (https://phabricator.wikimedia.org/T393733) (owner: 10Marostegui)
[13:33:31] <wikibugs>	 (03PS1) 10CDanis: haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065
[13:33:42] <Lucas_WMDE>	 I’m trying out an SQL query
[13:33:43] <Lucas_WMDE>	 SELECT eu_page_id, eu_entity_id, COUNT(*) FROM wbc_entity_usage WHERE eu_aspect LIKE 'A.%' GROUP BY eu_page_id, eu_entity_id HAVING COUNT(*) > 10 LIMIT 10;
[13:33:47] <Lucas_WMDE>	 not sure if it’ll work
[13:33:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (owner: 10CDanis)
[13:34:04] <seanleong-wmde>	 that's a great idea
[13:34:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1013.eqiad.wmnet, repooling both afterwards
[13:34:23] <stashbot>	 T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920
[13:34:47] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2029.codfw.wmnet
[13:34:59] <seanleong-wmde>	 im running it on hewiki
[13:35:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1014.eqiad.wmnet, repooling both afterwards
[13:35:17] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission es1030 [puppet] - 10https://gerrit.wikimedia.org/r/1198066 (https://phabricator.wikimedia.org/T407953)
[13:35:19] <seanleong-wmde>	 since that has the highest chance of having it as we ran the script there to update to alias already
[13:35:44] <wikibugs>	 (03PS2) 10CDanis: haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (https://phabricator.wikimedia.org/T406990)
[13:36:03] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1030.eqiad.wmnet
[13:36:05] <Lucas_WMDE>	 Empty set (1 min 59.356 sec)
[13:36:08] <Lucas_WMDE>	 well, so much for that idea
[13:36:10] <wikibugs>	 (03PS3) 10CDanis: haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (https://phabricator.wikimedia.org/T406990)
[13:36:11] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis)
[13:36:16] <Lucas_WMDE>	 ok, maybe it’ll work on hewiki
[13:36:30] <Lucas_WMDE>	 seanleong-wmde: where are you running the query? quarry? stat servers? something else?
[13:36:34] <seanleong-wmde>	 hahaha no results as well
[13:36:35] <seanleong-wmde>	 quarry
[13:36:37] <Lucas_WMDE>	 ok
[13:36:43] <Lucas_WMDE>	 just wanted to make sure you’re not using the production servers :D
[13:36:46] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Decommission es1030 [puppet] - 10https://gerrit.wikimedia.org/r/1198066 (https://phabricator.wikimedia.org/T407953)
[13:36:52] <Lucas_WMDE>	 then I guess we can’t really verify that it makes a difference or not
[13:36:54] <Lucas_WMDE>	 on existing pages at least
[13:37:00] <Lucas_WMDE>	 let’s trust your test module
[13:37:08] <seanleong-wmde>	 yeaa, I will continue to try and find after sync?
[13:37:09] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 seanleong-wmde, lucaswerkmeister-wmde: Continuing with sync
[13:37:14] <Lucas_WMDE>	 eh, no need imho
[13:37:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Decommission es1030 [puppet] - 10https://gerrit.wikimedia.org/r/1198066 (https://phabricator.wikimedia.org/T407953) (owner: 10Marostegui)
[13:37:31] <seanleong-wmde>	 okay
[13:39:58] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1014.eqiad.wmnet, repooling both afterwards
[13:40:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1015.eqiad.wmnet, repooling both afterwards
[13:40:03] <stashbot>	 T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920
[13:41:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1180523|Set Alias entity usage modifier limit to 10. (T401288)]] (duration: 20m 47s)
[13:41:24] <stashbot>	 T401288: Implement a more granular alias usage tracking - https://phabricator.wikimedia.org/T401288
[13:41:49] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.dns.netbox
[13:42:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2249:9290 - https://phabricator.wikimedia.org/T407879#11298266 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[13:43:23] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:43:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1015.eqiad.wmnet, repooling both afterwards
[13:44:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling both afterwards
[13:45:06] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1030.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
[13:45:22] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1030.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
[13:45:22] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:45:23] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1030.eqiad.wmnet
[13:45:23] <seanleong-wmde>	 thanks! Lucas_WMDE
[13:45:31] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953#11298285 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1003 for hosts: `es1030.eqiad.wmnet` - es1030.eqiad.wmnet (**...
[13:45:40] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953#11298286 (10Marostegui) a:05Marostegui→03None This is ready for #dc-ops
[13:46:04] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953#11298291 (10Marostegui)
[13:46:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11298295 (10Jhancock.wm) 05Open→03Resolved
[13:47:02] <wikibugs>	 (03CR) 10Joely Rooke WMDE: "Hi there! I am wondering if it's possible to keep this stream open for future tracking usages?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198029 (https://phabricator.wikimedia.org/T370045) (owner: 10Phuedx)
[13:47:36] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1198045 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah)
[13:49:03] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1198049 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[13:49:39] <icinga-wm>	 PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:50:02] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis)
[13:50:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling both afterwards
[13:50:48] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1017.eqiad.wmnet, repooling both afterwards
[13:50:51] <stashbot>	 T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920
[13:51:30] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es1030.eqiad.wmnet - https://phabricator.wikimedia.org/T407953#11298310 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[13:55:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1017.eqiad.wmnet, repooling both afterwards
[13:55:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1018.eqiad.wmnet, repooling both afterwards
[13:56:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:57:21] <seanleong-wmde>	 Lucas_WMDE https://en.wikipedia.org/w/index.php?title=Template:Sandbox/Seanleong8/Blank&action=info the changes are showing in live wiki now, thanks!
[13:58:51] <Lucas_WMDE>	 \o/
[13:59:46] <seanleong-wmde>	 o7
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1400)
[14:00:23] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1018.eqiad.wmnet, repooling both afterwards
[14:00:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1019.eqiad.wmnet, repooling both afterwards
[14:00:29] <stashbot>	 T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920
[14:01:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:02:45] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089)
[14:02:53] <wikibugs>	 (03CR) 10Michael Große: "The Check would not show." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) (owner: 10Michael Große)
[14:03:07] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089)
[14:03:50] <wikibugs>	 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11298377 (10Andrew) In any case, it's clear that preseed-test isn't going to help with the actual issue on 2010-dev :/
[14:03:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup)
[14:05:11] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1019.eqiad.wmnet, repooling both afterwards
[14:05:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1020.eqiad.wmnet, repooling both afterwards
[14:05:19] <wikibugs>	 (03CR) 10Cory Massaro: [C:03+2] Wikifunctions: Update function-evaluators from 2025-10-15-120631 to 2025-10-21-143846. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198039 (https://phabricator.wikimedia.org/T381060) (owner: 10Cory Massaro)
[14:05:24] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] superset: Increase the nginx proxy timeout (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene)
[14:06:58] <wikibugs>	 (03Merged) 10jenkins-bot: Wikifunctions: Update function-evaluators from 2025-10-15-120631 to 2025-10-21-143846. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198039 (https://phabricator.wikimedia.org/T381060) (owner: 10Cory Massaro)
[14:07:16] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:07:39] <wikibugs>	 (03Merged) 10jenkins-bot: superset: Increase the nginx proxy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197979 (https://phabricator.wikimedia.org/T407799) (owner: 10Stevemunene)
[14:07:58] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:08:44] <wikibugs>	 (03PS3) 10Ladsgroup: mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089)
[14:09:40] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:09:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1020.eqiad.wmnet, repooling both afterwards
[14:09:48] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1021.eqiad.wmnet, repooling both afterwards
[14:09:52] <stashbot>	 T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920
[14:10:26] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:10:36] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:11:07] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup)
[14:11:22] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:11:30] <wikibugs>	 (03CR) 10Majavah: [C:03+2] toolforge: toolviews: Drop nginx support [puppet] - 10https://gerrit.wikimedia.org/r/1198045 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah)
[14:11:36] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2032.codfw.wmnet
[14:12:00] <wikibugs>	 (03CR) 10Cory Massaro: [C:03+2] Wikifunctions: Upgrade orchestrator from 2025-10-14-194525 to 2025-10-22-011302. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198036 (https://phabricator.wikimedia.org/T381060) (owner: 10Cory Massaro)
[14:13:36] <wikibugs>	 (03Merged) 10jenkins-bot: Wikifunctions: Upgrade orchestrator from 2025-10-14-194525 to 2025-10-22-011302. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198036 (https://phabricator.wikimedia.org/T381060) (owner: 10Cory Massaro)
[14:14:33] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1021.eqiad.wmnet, repooling both afterwards
[14:14:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1022.eqiad.wmnet, repooling both afterwards
[14:14:48] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2031.codfw.wmnet
[14:15:33] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:16:00] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:16:15] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:16:43] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:16:50] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:17:15] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:18:11] <wikibugs>	 07sre-alert-triage, 06SRE Observability (FY2025/2026-Q2): Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100) - https://phabricator.wikimedia.org/T407484#11298507 (10hnowlan)
[14:19:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:19:21] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1022.eqiad.wmnet, repooling both afterwards
[14:19:25] <stashbot>	 T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920
[14:20:12] <logmsgbot>	 !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply
[14:20:44] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:toolforge::k8s::haproxy: Drop proxy IP rate limit exemption [puppet] - 10https://gerrit.wikimedia.org/r/1198049 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[14:21:39] <logmsgbot>	 !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply
[14:21:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: installserver: revert 'cloudcontrol2010-dev' to standard recipes [puppet] - 10https://gerrit.wikimedia.org/r/1198082 (https://phabricator.wikimedia.org/T407586)
[14:22:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] installserver: revert 'cloudcontrol2010-dev' to standard recipes [puppet] - 10https://gerrit.wikimedia.org/r/1198082 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi)
[14:22:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] installserver: revert 'cloudcontrol2010-dev' to standard recipes [puppet] - 10https://gerrit.wikimedia.org/r/1198082 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi)
[14:24:13] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Drop proxy IP rate limit exemption [puppet] - 10https://gerrit.wikimedia.org/r/1198049 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[14:25:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11298544 (10Jhancock.wm) @elukey found the server up. maybe it takes 5 million years to boot? i remember some of the ms-be supermicro servers had the same issue before with a slig...
[14:26:39] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Left a comment, if it is not a concern go ahead!" [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[14:27:15] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "It is fine to have this snippet in two places, but long term we may want to have it somewhere reusable/more-DRY :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198026 (owner: 10Cathal Mooney)
[14:28:38] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[14:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1400)
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1430)
[14:30:21] <wikibugs>	 (03PS12) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333)
[14:30:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins)
[14:30:54] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] rest-gateway: Deploy rate limiting in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert)
[14:31:28] <wikibugs>	 (03CR) 10Elukey: "Left some comments!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi)
[14:34:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:35:29] <wikibugs>	 (03PS13) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333)
[14:36:38] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins)
[14:39:24] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] dnsrecursor: use config dir instead of standalone file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins)
[14:39:39] <wikibugs>	 (03CR) 10CDanis: [C:03+2] haproxy: ja4h: all magru [puppet] - 10https://gerrit.wikimedia.org/r/1198065 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis)
[14:42:49] <wikibugs>	 (03CR) 10Tiziano Fogli: "We need to remove this declaration to avoid a duplicate resource declaration when including the pilot instance (tested on Pontoon)." [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron)
[14:43:53] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11298596 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date.
[14:44:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11298598 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date.
[14:44:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11298600 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date.
[14:44:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11298614 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date.
[14:44:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11298616 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date.
[14:44:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11298619 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date.
[14:44:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11298620 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date.
[14:44:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11298621 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date.
[14:48:28] <wikibugs>	 (03PS4) 10Ladsgroup: mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089)
[14:48:33] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup)
[14:48:35] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb::research: Add mysql user [puppet] - 10https://gerrit.wikimedia.org/r/1198077 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup)
[14:50:57] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2034.codfw.wmnet
[14:55:13] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2033.codfw.wmnet
[14:57:20] <wikibugs>	 (03PS3) 10Dr0ptp4kt: profile::pyrra: add two Xlab SLOs under the data-platform namespace [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey)
[14:58:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for dpogorzelski - https://phabricator.wikimedia.org/T407955#11298681 (10calbon) I approve this request
[14:58:41] <logmsgbot>	 !log filippo@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[14:59:10] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C:03+1] "I added `prometheus=\"k8s\"` to the definitions. Otherwise LGTM. +1'ing, for your next move @ltoscano@wikimedia.org . Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey)
[15:01:54] <wikibugs>	 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11298704 (10dancy)
[15:02:16] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991 (10Jhancock.wm) 03NEW
[15:02:58] <wikibugs>	 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11298723 (10dancy) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197986 has caused puppet to break on `deployment-cache-upload08.deployment-prep`.  Please help!
[15:06:07] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb::research: Add ferm hole for mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089)
[15:07:19] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb::research: Add ferm hole for mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089)
[15:08:25] <wikibugs>	 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11298750 (10ssingh) >>! In T404826#11298704, @dancy wrote: > https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197986 has caused puppet to break on `deployment-cache-upload...
[15:09:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:10:12] <wikibugs>	 (03PS3) 10Ladsgroup: mariadb::research: Add ferm hole for mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089)
[15:11:41] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup)
[15:13:32] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks!  Yeah I _thought_ the provision cookbook calls the sre.network.configure-switch-interfaces cookbook, but it seems it runs the func" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198026 (owner: 10Cathal Mooney)
[15:13:36] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] sre.hosts.provision: adjust to always use Homer to config Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198026 (owner: 10Cathal Mooney)
[15:16:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb::research: Add ferm hole for mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup)
[15:18:21] <wikibugs>	 (03PS1) 10Cathal Mooney: team-netops: add checks against Nokia OSPF status [alerts] - 10https://gerrit.wikimedia.org/r/1198095 (https://phabricator.wikimedia.org/T405558)
[15:18:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11298804 (10Papaul) @elukey on can you please provide me with one of the node that is working like you said so i can check what is different from this no...
[15:20:09] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.provision: adjust to always use Homer to config Nokia [cookbooks] - 10https://gerrit.wikimedia.org/r/1198026 (owner: 10Cathal Mooney)
[15:21:32] <wikibugs>	 (03PS2) 10Cathal Mooney: team-netops: add checks against Nokia OSPF status [alerts] - 10https://gerrit.wikimedia.org/r/1198095 (https://phabricator.wikimedia.org/T405558)
[15:23:17] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11298825 (10Marostegui)
[15:23:56] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11298826 (10Marostegui) The patch was done before this task got created, but linking it here for clarity https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197750
[15:24:12] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11298828 (10Marostegui)
[15:24:15] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:24:49] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11298834 (10Jhancock.wm) yes forgot to mention that while making this one. thank you so much for getting it done early!
[15:25:06] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10procurement: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11298835 (10Jhancock.wm) a:05Jhancock.wm→03None
[15:30:13] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2036.codfw.wmnet
[15:31:34] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586)
[15:32:59] <wikibugs>	 (03CR) 10Cathal Mooney: Nokia BGP: add function to get policy names based on BGP group name (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[15:34:10] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mariadb::research: Add ferm hole for mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/1198094 (https://phabricator.wikimedia.org/T389089) (owner: 10Ladsgroup)
[15:34:13] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:21] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[15:34:25] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2035.codfw.wmnet
[15:36:04] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia BGP: add function to get policy names based on BGP group name [homer/public] - 10https://gerrit.wikimedia.org/r/1198001 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[15:39:27] <wikibugs>	 06SRE, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994 (10amastilovic) 03NEW
[15:42:50] <wikibugs>	 (03PS1) 10Jcrespo: transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104
[15:43:38] <wikibugs>	 (03PS9) 10Btullis: Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943)
[15:44:33] <wikibugs>	 (03CR) 10Dreamy Jazz: hCaptcha: Enable hCaptcha for form edits on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[15:45:12] <wikibugs>	 (03CR) 10Dreamy Jazz: hCaptcha: Enable hCaptcha for form edits on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[15:45:39] <wikibugs>	 (03CR) 10Dreamy Jazz: hCaptcha: Enable hCaptcha for form edits on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[15:45:56] <wikibugs>	 (03PS2) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727)
[15:48:24] <wikibugs>	 (03PS2) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586)
[15:48:57] <wikibugs>	 (03CR) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[15:49:15] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[15:50:00] <wikibugs>	 (03CR) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[15:50:05] <wikibugs>	 (03CR) 10Kosta Harlan: [C:04-1] hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[15:50:27] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] admin: add lsandergreen to fr-tech-devs, add ssh [puppet] - 10https://gerrit.wikimedia.org/r/1198033 (https://phabricator.wikimedia.org/T406927) (owner: 10Kamila Součková)
[15:52:59] <wikibugs>	 (03CR) 10MSantos: [C:03+1] fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle)
[15:58:06] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[15:59:47] <wikibugs>	 (03PS2) 10Jcrespo: transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104
[16:00:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299061 (10Raine) a:03Raine
[16:03:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11299082 (10elukey) @Papaul this is the first dell config j that we flip to UEFI :)
[16:04:03] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on zuul2002.codfw.wmnet with reason: still in setup
[16:05:15] <wikibugs>	 (03PS2) 10Majavah: P:toolforge: Remove separate proxy role [puppet] - 10https://gerrit.wikimedia.org/r/1198050 (https://phabricator.wikimedia.org/T283948)
[16:05:15] <wikibugs>	 (03PS2) 10Majavah: P:toolforge: Remove long-obsolete proxylistener systemd unit code [puppet] - 10https://gerrit.wikimedia.org/r/1198051
[16:05:15] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::prometheus: Drop separate front proxy scrape target [puppet] - 10https://gerrit.wikimedia.org/r/1198105 (https://phabricator.wikimedia.org/T283948)
[16:05:28] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:05:34] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:06:00] <logmsgbot>	 !log ammarpad@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=bswiki --logwiki=metawiki Horvathbence200603 HorvBence  # T407995
[16:06:04] <stashbot>	 T407995: Unblock stuck global rename of HorvBence - https://phabricator.wikimedia.org/T407995
[16:06:12] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Make cloudweb Icinga checks non-critical [puppet] - 10https://gerrit.wikimedia.org/r/1196019 (https://phabricator.wikimedia.org/T407208) (owner: 10Majavah)
[16:08:30] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:08:36] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:11:26] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:11:29] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2038.codfw.wmnet
[16:11:31] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:16:30] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1198105 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[16:16:34] <logmsgbot>	 !log filippo@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[16:16:42] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2037.codfw.wmnet
[16:16:43] <wikibugs>	 (03PS3) 10Jcrespo: transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104
[16:16:58] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Drop separate front proxy scrape target [puppet] - 10https://gerrit.wikimedia.org/r/1198105 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[16:19:15] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:20:36] <wikibugs>	 (03PS1) 10Cathal Mooney: sre.hosts.provision: add code to support Homer/Nokia to Dell section [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342)
[16:23:11] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:23:31] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:27:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.provision: add code to support Homer/Nokia to Dell section [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: 10Cathal Mooney)
[16:34:14] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11299214 (10Raine) 05In progress→03Resolved a:03Raine Done, @Lars let me know if anything isn't working :-)
[16:37:23] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] varnish: Implement enable_m_redir and enable on test wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle)
[16:37:50] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] varnish: Enable enable_m_redir in Beta Cluster for all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197693 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle)
[16:38:57] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2203.codfw.wmnet
[16:38:59] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2203.codfw.wmnet
[16:39:11] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:40:20] <logmsgbot>	 !log kamila@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker2203.codfw.wmnet with reason: host unresponsive
[16:43:25] <logmsgbot>	 cmooney@cumin1003 provision (PID 2858243) is awaiting input
[16:44:15] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:44:38] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:51:02] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2040.codfw.wmnet
[16:51:51] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[16:54:15] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[16:56:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11299270 (10Jclark-ctr) a:05BTullis→03None
[16:56:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11299271 (10Jclark-ctr) a:03Jclark-ctr
[16:56:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11299272 (10Papaul) @elukey i think the next step will be to try to install the OS without setting up the boot disk and let the OS take care of it.  mayb...
[16:57:53] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:58:18] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2039.codfw.wmnet
[16:59:47] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS trixie
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1700)
[17:01:02] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004 (10Raine) 03NEW p:05Triage→03Low
[17:03:44] <wikibugs>	 (03CR) 10Jsn.sherman: [C:03+1] Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[17:09:56] <wikibugs>	 (03CR) 10Dzahn: zookeeper: add support for TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[17:10:02] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.dns.netbox
[17:11:03] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Check for same log_actor between local and global log entry [extensions/CentralAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198119 (https://phabricator.wikimedia.org/T398177)
[17:11:11] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Check for same log_actor between local and global log entry [extensions/CentralAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198120 (https://phabricator.wikimedia.org/T398177)
[17:11:11] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: unmask service & disable backup temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[17:11:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CentralAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198119 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[17:11:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CentralAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198120 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[17:12:43] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:12:51] <logmsgbot>	 !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS trixie
[17:14:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: 10D3r1ck01)
[17:15:58] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299360 (10Raine)
[17:16:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299376 (10Raine) @mark can you please approve this from the SRE side? Thanks!
[17:19:31] <wikibugs>	 (03PS1) 10Dzahn: zuul: temporarily make zuul2002 use nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1198123
[17:19:56] <wikibugs>	 (03PS2) 10Dzahn: zuul: temporarily make zuul2002 use nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1198123
[17:20:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: temporarily make zuul2002 use nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1198123 (owner: 10Dzahn)
[17:24:39] <wikibugs>	 (03PS1) 10Dzahn: zuul::base: pass srange firewall parameter as an array [puppet] - 10https://gerrit.wikimedia.org/r/1198126
[17:24:58] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1197720 (https://phabricator.wikimedia.org/T407917) (owner: 10Dzahn)
[17:25:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299432 (10Raine) a:05Raine→03mark
[17:26:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11299457 (10Raine) a:03KFrancis
[17:27:39] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1198126/7370/" [puppet] - 10https://gerrit.wikimedia.org/r/1198126 (owner: 10Dzahn)
[17:30:25] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2042.codfw.wmnet
[17:30:26] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_codfw and A:cp
[17:31:15] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS trixie
[17:36:59] <wikibugs>	 (03PS1) 10Dzahn: zuul::main: add firewall src sets CACHES to envoy Hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/1198127 (https://phabricator.wikimedia.org/T395938)
[17:37:57] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp2041.codfw.wmnet
[17:37:57] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_codfw and A:cp
[17:41:47] <Dreamy_Jazz>	 I've been finding Gerrit really unreliable throughout today
[17:42:09] <Dreamy_Jazz>	 Like connections being dropped entirely, and then only parts of the page loading
[17:42:14] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299634 (10Raine) confirmed key oob
[17:42:17] <Dreamy_Jazz>	 Sometimes thinking I'm signed out entirely
[17:42:24] <Dreamy_Jazz>	 but the next page load I am signed in
[17:42:44] <Dreamy_Jazz>	 Is this known?
[17:42:52] <wikibugs>	 (03PS3) 10Krinkle: varnish: Remove unreachable optin=beta code [puppet] - 10https://gerrit.wikimedia.org/r/1197730 (https://phabricator.wikimedia.org/T405931)
[17:43:11] <wikibugs>	 (03PS6) 10Krinkle: varnish: Enable enable_m_redir in esams and drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1197694 (https://phabricator.wikimedia.org/T405931)
[17:43:14] <wikibugs>	 (03PS1) 10Dzahn: site: move zuul2002 to insetup role temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1198128
[17:43:16] <wikibugs>	 (03PS10) 10Krinkle: varnish: Enable enable_m_redir everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1197695 (https://phabricator.wikimedia.org/T405931)
[17:43:40] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul::main: add firewall src sets CACHES to envoy Hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/1198127 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[17:43:41] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197694 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle)
[17:44:13] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[17:45:14] <wikibugs>	 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#11299639 (10Ladsgroup) 05Open→03Resolved I fully set up the VM now. Some automation is needed which I file a ticket for that later.
[17:47:38] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11299651 (10Raine)
[17:53:17] <Amir1>	 !log mwscript-k8s --dblist=small --follow -- purgeUserOptions.php --login-age 11 (T406724)
[17:53:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:22] <stashbot>	 T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724
[17:55:57] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] "key verified OOB" [puppet] - 10https://gerrit.wikimedia.org/r/1197720 (https://phabricator.wikimedia.org/T407917) (owner: 10Dzahn)
[18:00:05] <jouncebot>	 dancy and andre: That opportune time for a MediaWiki train - Utc-7+Utc-0 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T1800).
[18:02:33] <wikibugs>	 (03PS1) 10CDanis: haproxy: x-is-browser: --> Data Lake [puppet] - 10https://gerrit.wikimedia.org/r/1198130
[18:03:08] <dancy>	 o/
[18:04:05] <wikibugs>	 (03PS1) 10Kosta Harlan: Instrument the Suggested investigations feature [extensions/CheckUser] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198131 (https://phabricator.wikimedia.org/T404177)
[18:04:28] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1198128/7372/" [puppet] - 10https://gerrit.wikimedia.org/r/1198128 (owner: 10Dzahn)
[18:05:48] <wikibugs>	 (03PS1) 10Ssingh: varnish: add conditional to varnish::common::vcl for beta [puppet] - 10https://gerrit.wikimedia.org/r/1198132 (https://phabricator.wikimedia.org/T407966)
[18:06:57] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1198132 (https://phabricator.wikimedia.org/T407966) (owner: 10Ssingh)
[18:07:14] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198133 (https://phabricator.wikimedia.org/T405680)
[18:07:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198133 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot)
[18:08:08] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198133 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot)
[18:08:49] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on zuul2001.codfw.wmnet with reason: still in setup
[18:09:30] <Amir1>	 !log deleting local user_password on sul wikis (T104500)
[18:09:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:34] <stashbot>	 T104500: Old versions of sensitive user data (email, password hashes) can remain in database indefinitely due to local and global DB not being kept in sync - https://phabricator.wikimedia.org/T104500
[18:09:55] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: still in setup
[18:11:29] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on zuul2001.codfw.wmnet with reason: still in setup
[18:12:48] <wikibugs>	 (03PS3) 10Jcrespo: [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692)
[18:13:18] <wikibugs>	 (03PS4) 10Jcrespo: transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104
[18:14:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008 (10Urbanecm) 03NEW
[18:14:44] <wikibugs>	 (03CR) 10Jcrespo: "I am doing a deeper refactor, but I am implementing essentially your solution here:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff)
[18:16:27] <logmsgbot>	 !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.24  refs T405680
[18:16:32] <stashbot>	 T405680: 1.45.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T405680
[18:17:11] <logmsgbot>	 !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS trixie
[18:17:28] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[18:18:00] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS trixie
[18:19:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:24:20] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Original change I81ab37d461e0893d251fb9ad6026472b103b574c" [puppet] - 10https://gerrit.wikimedia.org/r/1198132 (https://phabricator.wikimedia.org/T407966) (owner: 10Ssingh)
[18:26:36] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[18:28:34] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[18:29:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1025.eqiad.wmnet, repooling both afterwards
[18:29:58] <stashbot>	 T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920
[18:31:21] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Special-pages: Special:Watchlist on Commons throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009 (10Josve05a) 03NEW
[18:31:38] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 221.94 ms
[18:34:05] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team: Special:Watchlist on Commons throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009#11299832 (10A_smart_kitten)
[18:34:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:34:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1025.eqiad.wmnet, repooling both afterwards
[18:34:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1026.eqiad.wmnet, repooling both afterwards
[18:36:19] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist on Commons throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009#11299851 (10Josve05a)
[18:37:47] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist on Commons throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009#11299860 (10Josve05a)
[18:38:33] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist on Commons throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009#11299863 (10Josve05a) >>! In T408010#11299812, @Xaosflux wrote: > {F66781980} >  > Able to...
[18:38:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: replace ssh keys with yubikey-backed key for Daniel Z - https://phabricator.wikimedia.org/T407917#11299865 (10Dzahn) a:03Dzahn
[18:39:28] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist on different wikis throws ‘InvalidArgumentException’ fatal error - https://phabricator.wikimedia.org/T408009#11299869 (10Josve05a)
[18:39:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1026.eqiad.wmnet, repooling both afterwards
[18:39:32] <stashbot>	 T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920
[18:39:49] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist throws ‘InvalidArgumentException’ fatal error on multiple projects - https://phabricator.wikimedia.org/T408009#11299872 (10Xaosflux)
[18:45:04] <wikibugs>	 (03PS1) 10Bking: WIP: deploy a test OpenSearch cluster in opensearch-ipoid-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753)
[18:48:14] <wikibugs>	 (03CR) 10Ssingh: dnsrecursor: use config dir instead of standalone file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins)
[18:49:36] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 07Wikimedia-production-error: Special:Watchlist throws ‘InvalidArgumentException’ fatal error on multiple projects - https://phabricator.wikimedia.org/T408009#11299889 (10Zabe) →14Duplicate dup:03T407996
[18:51:23] <wikibugs>	 (03CR) 10CDanis: [C:03+1] varnish: add conditional to varnish::common::vcl for beta [puppet] - 10https://gerrit.wikimedia.org/r/1198132 (https://phabricator.wikimedia.org/T407966) (owner: 10Ssingh)
[18:53:58] <sukhe>	 !log sudo cumin "A:cp" "disable-puppet 'merging CR 1198132'"
[18:54:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:30] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:01:28] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] varnish: add conditional to varnish::common::vcl for beta [puppet] - 10https://gerrit.wikimedia.org/r/1198132 (https://phabricator.wikimedia.org/T407966) (owner: 10Ssingh)
[19:02:55] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "nothing is buster except puppetmasters and maps:" [puppet] - 10https://gerrit.wikimedia.org/r/1197334 (owner: 10Dzahn)
[19:03:47] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zookeeper: drop safety check for buster, no more buster [puppet] - 10https://gerrit.wikimedia.org/r/1197334 (owner: 10Dzahn)
[19:06:49] <sukhe>	 !log sudo cumin "A:cp" "run-puppet-agent --enable 'merging CR 1198132'"
[19:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:59] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[19:07:17] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[19:09:49] <wikibugs>	 06SRE, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994#11299948 (10amastilovic)
[19:10:30] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:11:00] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1198095 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney)
[19:11:08] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: sleep test
[19:11:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle)
[19:24:15] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:24:30] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus-streaming-updater: update docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198141
[19:27:34] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[19:27:54] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[19:32:59] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[19:33:08] <wikibugs>	 (03CR) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse)
[19:33:18] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[19:36:13] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus-streaming-updater: update docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198141
[19:38:28] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11300023 (10VRiley-WMF) 05Open→03In progress Starting on ms-be1089
[19:38:30] <wikibugs>	 (03PS14) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333)
[19:40:18] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins)
[19:42:04] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] dnsrecursor: use config dir instead of standalone file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins)
[19:44:28] <icinga-wm>	 PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[19:45:59] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] cirrus-streaming-updater: update docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198141 (owner: 10Ebernhardson)
[19:47:42] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: update docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198141 (owner: 10Ebernhardson)
[19:49:15] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[19:51:32] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:51:43] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:52:05] <wikibugs>	 (03PS1) 10Kgraessle: Fix InvalidArgumentException in Watchlist [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198147 (https://phabricator.wikimedia.org/T407996)
[19:53:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198147 (https://phabricator.wikimedia.org/T407996) (owner: 10Kgraessle)
[19:54:50] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:55:05] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:55:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[19:59:54] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11300144 (10KFrancis) The NDA is complete.  Thanks!
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T2000)
[20:00:05] <jouncebot>	 Krinkle and katherine_g: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet - https://phabricator.wikimedia.org/T407605#11300145 (10KFrancis) The NDA is complete.  Thanks!
[20:00:18] <katherine_g>	 hi
[20:00:37] <wikibugs>	 (03CR) 10Brennen Bearnes: [C:03+1] "With https://gitlab.wikimedia.org/repos/phabricator/deployment/-/merge_requests/84 merged this should be fine." [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn)
[20:00:45] <Krinkle>	 hi
[20:02:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle)
[20:02:23] <Krinkle>	 rolling out mine meanwhile
[20:03:03] <katherine_g>	 sounds good, I'll deploy after you
[20:04:19] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[20:04:29] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:06:51] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11300161 (10VRiley-WMF)  ms-be1089 is completed, moving onto the next server ms-be1090
[20:08:14] <wikibugs>	 (03PS1) 10Dzahn: gerrit: set QoS to log_only [puppet] - 10https://gerrit.wikimedia.org/r/1198148 (https://phabricator.wikimedia.org/T406774)
[20:08:25] <wikibugs>	 (03PS2) 10Cathal Mooney: sre.hosts.provision: move the switch config to parent class and run [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342)
[20:13:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: set QoS to log_only [puppet] - 10https://gerrit.wikimedia.org/r/1198148 (https://phabricator.wikimedia.org/T406774) (owner: 10Dzahn)
[20:13:35] <wikibugs>	 (03Merged) 10jenkins-bot: fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197866 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle)
[20:14:09] <logmsgbot>	 !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1197866|fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration (T407403)]]
[20:14:14] <stashbot>	 T407403: Error: Invalid serialization data for DatePeriod object - https://phabricator.wikimedia.org/T407403
[20:15:50] <tgr_>	 Krinkle: FWIW https://phabricator.wikimedia.org/T407403#11300176 (although I agree it's fine to just wait and see if anything breaks)
[20:18:30] <logmsgbot>	 !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1197866|fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration (T407403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:19:15] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:20:12] <logmsgbot>	 !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS trixie
[20:22:17] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:22:35] <logmsgbot>	 !log krinkle@deploy2002 krinkle: Continuing with sync
[20:22:50] <Krinkle>	 tgr_: thx
[20:23:09] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:23:31] <logmsgbot>	 andrew@cumin2002 reimage (PID 4063959) is awaiting input
[20:25:34] <Krinkle>	 katherine_g: nearly done 
[20:26:47] <logmsgbot>	 !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197866|fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration (T407403)]] (duration: 12m 38s)
[20:26:52] <stashbot>	 T407403: Error: Invalid serialization data for DatePeriod object - https://phabricator.wikimedia.org/T407403
[20:29:05] <wikibugs>	 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11300199 (10Andrew) I don't know what a healthy grub run looks like, but I'm not loving this:   ` Oct 22 19:46:10 grub-installer: info: Running chroot /target grub-install  --for ce "/d...
[20:29:23] <Krinkle>	 katherine_g: all yours
[20:29:28] <icinga-wm>	 RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[20:29:30] <katherine_g>	 krinkle: thanks  
[20:30:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198147 (https://phabricator.wikimedia.org/T407996) (owner: 10Kgraessle)
[20:30:39] <wikibugs>	 (03PS18) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054)
[20:30:59] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[20:31:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:31:20] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm
[20:31:40] <jinxer-wm>	 FIRING: DiskSpace: Disk space ml-serve1012:9100:/ 4.769% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[20:34:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:35:12] <wikibugs>	 (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7396/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron)
[20:35:25] <jinxer-wm>	 RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[20:36:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:36:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:36:40] <jinxer-wm>	 RESOLVED: DiskSpace: Disk space ml-serve1012:9100:/ 4.8% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[20:39:40] <jinxer-wm>	 FIRING: DiskSpace: Disk space ml-serve1012:9100:/ 4.767% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[20:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:41:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:45:04] <wikibugs>	 (03CR) 10Herron: [V:03+1] "Thanks, this turned up in pcc as well and I forgot to upload before tagging you.  Sorry for the false start, my bad!  Sorted out now" [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron)
[20:45:43] <wikibugs>	 (03Merged) 10jenkins-bot: Fix InvalidArgumentException in Watchlist [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198147 (https://phabricator.wikimedia.org/T407996) (owner: 10Kgraessle)
[20:46:15] <logmsgbot>	 !log kgraessle@deploy2002 Started scap sync-world: Backport for [[gerrit:1198147|Fix InvalidArgumentException in Watchlist (T407996)]]
[20:46:20] <stashbot>	 T407996: InvalidArgumentException: Unknown filter module "latest" - https://phabricator.wikimedia.org/T407996
[20:47:28] <wikibugs>	 (03PS1) 10JHathaway: sysctls: update sysctls 5min after boot [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726)
[20:49:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sysctls: update sysctls 5min after boot [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway)
[20:50:36] <logmsgbot>	 !log kgraessle@deploy2002 kgraessle: Backport for [[gerrit:1198147|Fix InvalidArgumentException in Watchlist (T407996)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:52:57] <logmsgbot>	 !log kgraessle@deploy2002 kgraessle: Continuing with sync
[20:53:16] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:54:01] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:54:15] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[20:54:27] <wikibugs>	 (03PS2) 10JHathaway: sysctls: update sysctls 5min after boot [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726)
[20:57:04] <logmsgbot>	 !log kgraessle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198147|Fix InvalidArgumentException in Watchlist (T407996)]] (duration: 10m 49s)
[20:57:09] <stashbot>	 T407996: InvalidArgumentException: Unknown filter module "latest" - https://phabricator.wikimedia.org/T407996
[20:58:30] <Josve05a>	 yay my watchlist now works again, thanks katherine_g :D
[20:59:03] <katherine_g>	 Josve05a: np :) 
[20:59:46] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T2100)
[21:02:17] <wikibugs>	 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11300347 (10Andrew) Here is the equivalent for bookworm (which works):   ` Oct 22 20:45:11 grub-installer: info: Running chroot /target grub-install  --for ce "/dev/sdd" Oct 22 20:45:11...
[21:04:01] <jinxer-wm>	 RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:09:30] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm
[21:09:50] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[21:13:07] <wikibugs>	 (03PS3) 10JHathaway: sysctls: update sysctls 5min after boot [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726)
[21:13:21] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway)
[21:20:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:25:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:26:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:27:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.638s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:30:25] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add 2 large prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1198157 (https://phabricator.wikimedia.org/T406774)
[21:30:38] <wikibugs>	 (03CR) 10Scott French: "Thanks, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli)
[21:31:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:31:48] <wikibugs>	 (03PS2) 10Dzahn: gerrit: add 2 large prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1198157 (https://phabricator.wikimedia.org/T406774)
[21:33:52] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Effie! I believe you should be good to proceed with this patch, as long as you rebase it onto `production` **first** to decouple i" [puppet] - 10https://gerrit.wikimedia.org/r/1198035 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli)
[21:37:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.3s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:37:47] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: add 2 large prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1198157 (https://phabricator.wikimedia.org/T406774) (owner: 10Dzahn)
[21:37:48] <wikibugs>	 (03PS5) 10RLazarus: deployment_server: Refactor charlie to add a Service dataclass [puppet] - 10https://gerrit.wikimedia.org/r/1195352
[21:37:48] <wikibugs>	 (03PS3) 10RLazarus: deployment_server: Add --priority to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196989 (https://phabricator.wikimedia.org/T406212)
[21:37:49] <wikibugs>	 (03PS3) 10RLazarus: deployment_server: Add --dangerously_fast to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196990 (https://phabricator.wikimedia.org/T406212)
[21:41:02] <wikibugs>	 (03CR) 10Hashar: gerrit: add 2 large prefixes to abusers list (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198157 (https://phabricator.wikimedia.org/T406774) (owner: 10Dzahn)
[21:41:20] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] deployment_server: Prefix `helmfile apply` output with "[service env]" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192282 (owner: 10RLazarus)
[21:43:05] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "Thanks @glavagetto@wikimedia.org for both reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1195352 (owner: 10RLazarus)
[21:43:18] <wikibugs>	 (03PS6) 10RLazarus: deployment_server: Refactor charlie to add a Service dataclass [puppet] - 10https://gerrit.wikimedia.org/r/1195352
[21:44:15] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[21:45:51] <wikibugs>	 (03CR) 10Bking: [C:03+1] Update the definition of @dse_kubepods_networks [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis)
[21:46:32] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] deployment_server: Refactor charlie to add a Service dataclass [puppet] - 10https://gerrit.wikimedia.org/r/1195352 (owner: 10RLazarus)
[21:49:47] <wikibugs>	 (03PS1) 10Dzahn: gerrit: adding a network to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1198159 (https://phabricator.wikimedia.org/T408023)
[21:53:05] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: adding a network to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1198159 (https://phabricator.wikimedia.org/T408023) (owner: 10Dzahn)
[21:59:55] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add another IPv6 prefix to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198160 (https://phabricator.wikimedia.org/T408023)
[22:00:03] <logmsgbot>	 andrew@cumin2002 reimage (PID 4083792) is awaiting input
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T2200)
[22:00:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: add another IPv6 prefix to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198160 (https://phabricator.wikimedia.org/T408023) (owner: 10Dzahn)
[22:02:20] <wikibugs>	 (03PS2) 10Dzahn: gerrit: add another IPv6 prefix to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198160 (https://phabricator.wikimedia.org/T408023)
[22:02:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: add 2 large prefixes to abusers list (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198157 (https://phabricator.wikimedia.org/T406774) (owner: 10Dzahn)
[22:02:49] <wikibugs>	 (03PS5) 10JHathaway: sre.hardware.upgrade-firmware: improve matching for SSD checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[22:03:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078']
[22:03:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: add another IPv6 prefix to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198160 (https://phabricator.wikimedia.org/T408023) (owner: 10Dzahn)
[22:03:38] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2078']
[22:05:20] <wikibugs>	 (03PS1) 10Ryan Kemper: (wip) wdqs: detect blazegraph deadlock [alerts] - 10https://gerrit.wikimedia.org/r/1198161 (https://phabricator.wikimedia.org/T389859)
[22:06:42] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2058']
[22:07:25] <wikibugs>	 (03PS1) 10Reedy: Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198162 (https://phabricator.wikimedia.org/T405235)
[22:07:30] <Reedy>	 jouncebot: nowandnext
[22:07:30] <jouncebot>	 For the next 0 hour(s) and 52 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251022T2200)
[22:07:30] <jouncebot>	 In 7 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T0600)
[22:07:30] <jouncebot>	 In 7 hour(s) and 52 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T0600)
[22:07:38] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198162 (https://phabricator.wikimedia.org/T405235) (owner: 10Reedy)
[22:07:43] <wikibugs>	 (03PS1) 10Reedy: Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198163 (https://phabricator.wikimedia.org/T405235)
[22:07:51] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198163 (https://phabricator.wikimedia.org/T405235) (owner: 10Reedy)
[22:08:23] <Jdlrobson>	 Reedy let me know when you are done. I have a couple of deployments i need to do
[22:08:27] <Jdlrobson>	 TIA
[22:08:52] <Reedy>	 Jdlrobson: These are just maintenance scripts, so a noop for production
[22:09:23] <Jdlrobson>	 so okay for me to proceed?
[22:09:30] <Jdlrobson>	 or do you want to finish up what you are doing first?
[22:10:01] <Reedy>	 You should be god to continue, those will take a little while to get through CI
[22:10:32] <Jdlrobson>	 ok thanks
[22:11:00] <wikibugs>	 (03PS3) 10Jdlrobson: [labs] Move namespaces to audience definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194304 (https://phabricator.wikimedia.org/T404152)
[22:11:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194304 (https://phabricator.wikimedia.org/T404152) (owner: 10Jdlrobson)
[22:12:28] <wikibugs>	 (03Merged) 10jenkins-bot: [labs] Move namespaces to audience definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194304 (https://phabricator.wikimedia.org/T404152) (owner: 10Jdlrobson)
[22:12:48] <wikibugs>	 (03PS2) 10Jdlrobson: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841)
[22:12:57] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1002']
[22:13:53] <logmsgbot>	 !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1002']
[22:14:59] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2058']
[22:15:02] <wikibugs>	 (03Merged) 10jenkins-bot: Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198162 (https://phabricator.wikimedia.org/T405235) (owner: 10Reedy)
[22:15:11] <wikibugs>	 (03Merged) 10jenkins-bot: Add maintenance script to migrate recovery tokens to their own device [extensions/OATHAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198163 (https://phabricator.wikimedia.org/T405235) (owner: 10Reedy)
[22:15:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078']
[22:15:18] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003']
[22:15:33] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2078']
[22:15:43] <logmsgbot>	 !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003']
[22:17:08] <wikibugs>	 (03PS3) 10Jdlrobson: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841)
[22:17:13] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2058']
[22:17:17] <wikibugs>	 (03CR) 10Jdlrobson: "Reivisi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841) (owner: 10Jdlrobson)
[22:17:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078']
[22:17:41] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2058']
[22:19:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841) (owner: 10Jdlrobson)
[22:19:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:19:54] <wikibugs>	 (03Merged) 10jenkins-bot: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841) (owner: 10Jdlrobson)
[22:19:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008#11300711 (10thcipriani) As noted in the description, @Urbanecm and I chatted, the rationale for access looks good to me. I approve!
[22:20:26] <logmsgbot>	 !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1194309|Enable QuickSurveys on all wikis (T317841)]]
[22:20:30] <stashbot>	 T317841: Simplify QuickSurveys configuration by enabling everywhere - https://phabricator.wikimedia.org/T317841
[22:23:45] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003']
[22:24:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2078']
[22:24:39] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1194309|Enable QuickSurveys on all wikis (T317841)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:25:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078']
[22:25:18] <Reedy>	 !log T407057 - ran mwscript extensions/OATHAuth/maintenance/MoveRecoveryCodesFromTOTP.php --wiki=officewiki
[22:25:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:25:22] <stashbot>	 T407057: Run MoveRecoveryCodesFromTOTP.php - https://phabricator.wikimedia.org/T407057
[22:25:29] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2078']
[22:26:29] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson: Continuing with sync
[22:29:24] <logmsgbot>	 !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003']
[22:30:38] <logmsgbot>	 !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194309|Enable QuickSurveys on all wikis (T317841)]] (duration: 10m 12s)
[22:30:43] <stashbot>	 T317841: Simplify QuickSurveys configuration by enabling everywhere - https://phabricator.wikimedia.org/T317841
[22:31:37] <Reedy>	 !log T407057 - ran foreachwikiindblist fishbowl.dblist extensions/OATHAuth/maintenance/MoveRecoveryCodesFromTOTP.php
[22:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:41] <stashbot>	 T407057: Run MoveRecoveryCodesFromTOTP.php - https://phabricator.wikimedia.org/T407057
[22:32:24] <Reedy>	 !log T407057 - ran foreachwikiindblist private.dblist extensions/OATHAuth/maintenance/MoveRecoveryCodesFromTOTP.php
[22:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2078']
[22:33:00] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2078']
[22:34:34] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003']
[22:35:01] <Jdlrobson>	 Done!
[22:35:09] <logmsgbot>	 !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003']
[22:36:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via next (k8s) 1.446s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:37:55] <wikibugs>	 (03CR) 10JHathaway: sre.hardware.upgrade-firmware: improve matching for SSD checks (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[22:41:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via next (k8s) 1.157s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:49:07] <Reedy>	 !log T407057 - ran mwscript extensions/OATHAuth/maintenance/MoveRecoveryCodesFromTOTP.php --wiki=metawiki
[22:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:12] <stashbot>	 T407057: Run MoveRecoveryCodesFromTOTP.php - https://phabricator.wikimedia.org/T407057
[23:16:16] <wikibugs>	 (03PS3) 10Reedy: CommonSettings-labs: Remove OATHAuth config that are the same as prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195191 (https://phabricator.wikimedia.org/T404807)
[23:16:17] <wikibugs>	 (03PS1) 10Tim Starling: recentchanges: QueryRateEstimator improvements [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198178 (https://phabricator.wikimedia.org/T403798)
[23:16:20] <wikibugs>	 (03CR) 10Reedy: [C:03+2] CommonSettings-labs: Remove OATHAuth config that are the same as prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195191 (https://phabricator.wikimedia.org/T404807) (owner: 10Reedy)
[23:17:31] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings-labs: Remove OATHAuth config that are the same as prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195191 (https://phabricator.wikimedia.org/T404807) (owner: 10Reedy)
[23:24:15] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:24:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11300946 (10Papaul) While trying to use the firmware upgrade cookbook with "sudo cookbook sre.hardware.upgrade-firmware ms-be2078 --new" i get the error...
[23:25:09] <wikibugs>	 (03PS1) 10Reedy: CommonSettings: Remove some OATHAuth config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198180 (https://phabricator.wikimedia.org/T404806)
[23:25:27] <wikibugs>	 (03CR) 10Reedy: [C:04-2] "Needs next weeks train to go through" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198180 (https://phabricator.wikimedia.org/T404806) (owner: 10Reedy)
[23:38:16] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198181
[23:38:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198181 (owner: 10TrainBranchBot)
[23:49:15] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[23:52:21] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198181 (owner: 10TrainBranchBot)