[00:00:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:00:39] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:01:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:01:37] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:05:23] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: sync [00:08:48] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: sync [00:09:40] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool pool db1227: Work done [00:20:40] RESOLVED: ProbeDown: Service aqs1025-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1025-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:28:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:22] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2208: Work done [00:51:30] !log eevans@cumin1003 START - Cookbook sre.hosts.remove-downtime for aqs1025.eqiad.wmnet [00:51:31] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1025.eqiad.wmnet [00:57:34] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1227: Work done [01:09:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.24 [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270604 (https://phabricator.wikimedia.org/T420482) [01:09:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.24 [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270604 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [01:09:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270605 [01:09:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270605 (owner: 10TrainBranchBot) [01:19:49] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.24 [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270604 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [01:19:56] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270605 (owner: 10TrainBranchBot) [01:50:59] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0200) [02:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:59] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0300) [03:01:53] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270614 (https://phabricator.wikimedia.org/T420482) [03:01:56] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270614 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [03:02:53] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270614 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [03:03:16] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.46.0-wmf.24 refs T420482 [03:03:20] T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482 [03:39:00] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.46.0-wmf.24 refs T420482 (duration: 35m 44s) [03:39:04] T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482 [03:56:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [03:56:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [03:58:09] !incidents [03:58:09] 7833 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [03:58:27] Let's see [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0400) [04:00:37] checking too [04:02:04] jelto: codfw/ulsfo correct ? [04:02:37] !log mwpresync@deploy1003 Pruned MediaWiki: 1.46.0-wmf.21 (duration: 02m 34s) [04:04:03] looking into turnilo [04:04:26] godog: yes the transport to cr4-ulsfo [04:04:37] I also look in superset [04:04:51] (03PS1) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1270617 [04:04:56] ok [04:26:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [04:26:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [04:27:37] !incidents [04:27:37] 7833 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [04:28:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:51:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [04:51:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [04:52:05] !incidents [04:52:05] 7834 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [04:52:05] 7833 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [05:01:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [05:01:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [05:07:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:09:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:12:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:19:36] (03CR) 10Marostegui: [C:03+1] mariadb: wiki-replicas: remove redundant grants [puppet] - 10https://gerrit.wikimedia.org/r/1270464 (https://phabricator.wikimedia.org/T422806) (owner: 10FNegri) [05:20:02] (03CR) 10Marostegui: [C:03+1] mariadb: wiki-replicas: add grants for %_maintain [puppet] - 10https://gerrit.wikimedia.org/r/1270465 (https://phabricator.wikimedia.org/T422806) (owner: 10FNegri) [05:21:49] (03PS1) 10Marostegui: installserver: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270758 (https://phabricator.wikimedia.org/T423151) [05:25:30] (03CR) 10Marostegui: [C:03+2] installserver: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270758 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui) [05:27:09] (03PS1) 10Marostegui: db2217: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270760 (https://phabricator.wikimedia.org/T422777) [05:27:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2217.codfw.wmnet with reason: Reimage to Trixie [05:27:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2217: Reimage [05:28:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2217: Reimage [05:29:57] (03CR) 10Marostegui: [C:03+2] db2217: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270760 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui) [05:30:18] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2217.codfw.wmnet with OS trixie [05:35:24] (03PS1) 10Marostegui: db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270761 (https://phabricator.wikimedia.org/T422777) [05:35:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1180: Upgrade package [05:35:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1180.eqiad.wmnet with reason: Reimage to Trixie [05:36:06] (03CR) 10Marostegui: [C:03+2] db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270761 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui) [05:39:04] (03PS1) 10Marostegui: eqiad.yaml: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270764 (https://phabricator.wikimedia.org/T423151) [05:40:44] (03PS1) 10Daniel Kinzler: API rate limits: add highlimits-user class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270765 (https://phabricator.wikimedia.org/T419796) [05:41:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:41:34] (03CR) 10CI reject: [V:04-1] API rate limits: add highlimits-user class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270765 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler) [05:43:19] (03PS2) 10Daniel Kinzler: API rate limits: add highlimits-user class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270765 (https://phabricator.wikimedia.org/T419796) [05:46:04] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db1180: Upgrade package [05:47:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1180.eqiad.wmnet with OS trixie [05:49:07] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2217.codfw.wmnet with reason: host reimage [05:52:47] (03PS1) 10Daniel Kinzler: rest-gateway: add some more IPs of large-scale NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270766 [05:54:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2217.codfw.wmnet with reason: host reimage [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0600) [06:00:05] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0600). [06:00:37] (03CR) 10Muehlenhoff: [C:03+2] Temporarily depool puppetserver1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/1270441 (owner: 10Muehlenhoff) [06:00:57] !log jmm@dns1004 START - running authdns-update [06:02:13] !log jmm@dns1004 END - running authdns-update [06:02:40] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1180.eqiad.wmnet with reason: host reimage [06:03:53] (03PS1) 10Daniel Kinzler: rest-gateway: try per-minute limits in shadow-mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270767 [06:04:55] (03CR) 10Marostegui: "@taavi@wikimedia.org @fnegri@wikimedia.org is there anything needed after merging this to get the host removed from the LB?" [puppet] - 10https://gerrit.wikimedia.org/r/1270764 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui) [06:08:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1180.eqiad.wmnet with reason: host reimage [06:09:49] (03PS1) 10Marostegui: Revert "db2217: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270768 [06:11:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2217.codfw.wmnet with OS trixie [06:11:45] (03CR) 10Marostegui: [C:03+2] Revert "db2217: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270768 (owner: 10Marostegui) [06:12:13] (03PS1) 10Marostegui: Revert "db1180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270769 [06:13:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2217: repool after reimage to trixie [06:16:12] (03PS5) 10Ryan Kemper: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [06:19:14] (03PS6) 10Ryan Kemper: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [06:20:03] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [06:20:57] !log jmm@dns1004 START - running authdns-update [06:22:12] !log jmm@dns1004 END - running authdns-update [06:25:42] !log jmm@dns1004 START - running authdns-update [06:27:03] !log jmm@dns1004 END - running authdns-update [06:30:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [06:30:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1180.eqiad.wmnet with OS trixie [06:30:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [06:31:11] (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::striker: Remove separate monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/1270282 (owner: 10Majavah) [06:33:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver2002.codfw.wmnet [06:36:34] 10ops-eqiad, 06DC-Ops: Work on storage room cleanup - https://phabricator.wikimedia.org/T423227 (10VRiley-WMF) 03NEW [06:38:31] (03CR) 10JMeybohm: [C:03+1] "SGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [06:39:27] (03CR) 10JMeybohm: [C:03+2] prometheus::k8s: Ingest envoy cluster_update metrics [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm) [06:40:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2002.codfw.wmnet [06:40:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1002.eqiad.wmnet [06:41:43] (03CR) 10Ryan Kemper: [C:03+1] "PCC looking good. I fixed a few issues:" [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [06:43:10] (03PS3) 10JMeybohm: kubernetes: Remove docker as supported container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1262154 (https://phabricator.wikimedia.org/T395870) [06:45:13] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11817793 (10JMeybohm) [06:45:35] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11817794 (10JMeybohm) [06:47:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1002.eqiad.wmnet [06:50:39] (03PS1) 10Muehlenhoff: Revert "Temporarily depool puppetserver1002/2002" [dns] - 10https://gerrit.wikimedia.org/r/1270770 [06:56:31] (03CR) 10Muehlenhoff: [C:03+2] Revert "Temporarily depool puppetserver1002/2002" [dns] - 10https://gerrit.wikimedia.org/r/1270770 (owner: 10Muehlenhoff) [06:56:37] !log jmm@dns1004 START - running authdns-update [06:57:59] !log jmm@dns1004 END - running authdns-update [06:58:34] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2217: repool after reimage to trixie [07:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:22] (03CR) 10Muehlenhoff: [C:03+2] Add a new Cumin alias to match hosts which are accessible via kerberized SSH [puppet] - 10https://gerrit.wikimedia.org/r/1270279 (owner: 10Muehlenhoff) [07:04:59] (03CR) 10Marostegui: [C:03+2] Revert "db1180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270769 (owner: 10Marostegui) [07:05:35] (03PS1) 10Mszwarc: Prepare $wgOATH2FARequiredGroupRemovalPages for next groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270772 (https://phabricator.wikimedia.org/T423118) [07:06:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270772 (https://phabricator.wikimedia.org/T423118) (owner: 10Mszwarc) [07:06:25] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1180: after upgrade [07:07:12] I see no deploys are going on, I'll proceed with deploying that change ^ [07:08:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270772 (https://phabricator.wikimedia.org/T423118) (owner: 10Mszwarc) [07:08:32] (03PS1) 10Marostegui: db2180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270773 [07:08:57] (03Merged) 10jenkins-bot: Prepare $wgOATH2FARequiredGroupRemovalPages for next groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270772 (https://phabricator.wikimedia.org/T423118) (owner: 10Mszwarc) [07:09:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2180.codfw.wmnet with reason: Reimage to Trixie [07:09:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2180.codfw.wmnet with reason: Reimage to Trixie [07:10:01] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1270772|Prepare $wgOATH2FARequiredGroupRemovalPages for next groups (T423118)]] [07:10:05] T423118: FY25-26 Q4: Phase 1 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423118 [07:11:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2180.codfw.wmnet with reason: Reimage to Trixie [07:11:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2180.codfw.wmnet with reason: Reimage to Trixie [07:11:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2180: Reimage to Trixie [07:11:55] (03CR) 10Marostegui: [C:03+2] db2180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270773 (owner: 10Marostegui) [07:12:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2180: Reimage to Trixie [07:14:02] (03PS2) 10Brouberol: deployment_server: expand IPs behind the dumps-wikimedia external service [puppet] - 10https://gerrit.wikimedia.org/r/1270771 [07:14:23] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2180.codfw.wmnet with OS trixie [07:15:40] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1270772|Prepare $wgOATH2FARequiredGroupRemovalPages for next groups (T423118)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:15:44] T423118: FY25-26 Q4: Phase 1 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423118 [07:16:29] !log mszwarc@deploy1003 mszwarc: Continuing with sync [07:22:38] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270772|Prepare $wgOATH2FARequiredGroupRemovalPages for next groups (T423118)]] (duration: 12m 36s) [07:22:41] T423118: FY25-26 Q4: Phase 1 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423118 [07:29:24] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11817924 (10OKryva-WMF) [07:32:16] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2180.codfw.wmnet with reason: host reimage [07:32:19] (03CR) 10Marostegui: [C:03+1] mariadb: Migrate section-specific DBA access rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [07:35:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye [07:36:11] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11817941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye [07:36:27] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2068 [07:36:29] (03PS1) 10Klausman: admin/klausman: remove non-YK SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1270780 [07:36:40] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [07:38:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2180.codfw.wmnet with reason: host reimage [07:40:26] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2068 - mvernon@cumin2002" [07:40:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2068 - mvernon@cumin2002" [07:40:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:40:32] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2068.codfw.wmnet 91.32.192.10.in-addr.arpa 1.9.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:40:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2068.codfw.wmnet 91.32.192.10.in-addr.arpa 1.9.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:40:37] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2068 [07:41:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2068 [07:41:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2068 [07:46:59] FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare [07:49:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on pc2012.codfw.wmnet with reason: T419961 [07:49:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on pc1012.eqiad.wmnet with reason: T419961 [07:51:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1180: after upgrade [07:54:20] (03PS1) 10Marostegui: Revert "db2180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270813 [07:54:59] (03CR) 10Marostegui: [C:03+2] Revert "db2180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270813 (owner: 10Marostegui) [07:58:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [07:59:55] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11818041 (10Marostegui) [08:00:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2180.codfw.wmnet with OS trixie [08:00:22] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2180: repool after maintenance [08:01:58] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc1012: T419961 [08:01:58] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [08:02:11] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [08:02:11] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1012: T419961 [08:02:26] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc1012: T419961 [08:02:26] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [08:02:34] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [08:02:34] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1012: T419961 [08:02:44] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc2012: T419961 [08:02:44] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [08:02:50] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [08:02:50] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2012: T419961 [08:04:20] !log installing libnginx-mod-http-lua security updates [08:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:34] (03PS1) 10Marostegui: db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270859 [08:05:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [08:05:18] (03CR) 10Marostegui: [C:03+2] db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270859 (owner: 10Marostegui) [08:05:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1168.eqiad.wmnet with reason: Reimage to Trixie [08:05:45] (03PS1) 10MVernon: preseed: move ms-be206[8,9] to new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1270860 (https://phabricator.wikimedia.org/T354872) [08:05:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1168: Reimage to Trixie [08:06:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1168: Reimage to Trixie [08:07:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1168.eqiad.wmnet with OS trixie [08:10:34] (03CR) 10Marostegui: [C:03+1] preseed: move ms-be206[8,9] to new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1270860 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [08:14:01] (03CR) 10MVernon: [C:03+2] preseed: move ms-be206[8,9] to new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1270860 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [08:17:30] (03CR) 10Ayounsi: [C:04-1] kubernetes-generic: Add alerts for BGP failure scenarios. (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [08:20:43] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2012: T419961 [08:20:44] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [08:20:57] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [08:20:57] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2012: T419961 [08:22:10] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1168.eqiad.wmnet with reason: host reimage [08:23:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2068.codfw.wmnet with OS bullseye [08:23:17] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11818181 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet... [08:25:17] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2068 [08:25:34] !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet [08:25:43] (03CR) 10Brouberol: [C:03+1] "Excellent commit message. thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270559 (https://phabricator.wikimedia.org/T420696) (owner: 10Ryan Kemper) [08:28:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:28:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:51] (03CR) 10Brouberol: [C:04-1] "The bug was in the original module, and was fixed. The real fix here is to update the module" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper) [08:31:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1168.eqiad.wmnet with reason: host reimage [08:34:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2068.codfw.wmnet [08:34:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [08:35:06] (03PS3) 10Arnaudb: gerrit: migrate gerrit_site away from root partition [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) [08:35:06] (03CR) 10Arnaudb: "the pcc output is visible here: https://puppet-compiler.wmflabs.org/output/1270774/6384/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [08:35:28] (03PS2) 10Arnaudb: gerrit: update sync-instances cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) [08:35:28] (03CR) 10Arnaudb: "related to 1270774" [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb) [08:37:46] (03CR) 10Blake: [C:03+2] kubernetes: Remove docker as supported container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1262154 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm) [08:38:44] (03CR) 10Blake: [C:03+2] kubernetes: Remove docker related hiera settings from nodes [puppet] - 10https://gerrit.wikimedia.org/r/1260742 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm) [08:40:20] (03CR) 10Blake: [C:03+2] admin: add Blake's backup SSH key. [puppet] - 10https://gerrit.wikimedia.org/r/1270436 (owner: 10Blake) [08:40:45] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11818235 (10MLechvien-WMF) p:05Triage→03Low [08:42:50] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2190 System Configuration Check error - https://phabricator.wikimedia.org/T423175#11818240 (10MLechvien-WMF) p:05Triage→03Low [08:43:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [08:43:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90618 and previous config saved to /var/cache/conftool/dbconfig/20260414-084353-fceratto.json [08:43:57] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:45:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2180: repool after maintenance [08:52:12] (03PS1) 10Marostegui: Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270866 [08:52:51] (03CR) 10Elukey: [C:03+2] profile::pki::intermediates: update debmonitor's public key [puppet] - 10https://gerrit.wikimedia.org/r/1270286 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:59:24] (03CR) 10FNegri: [C:03+1] "I think this patch is all we need." [puppet] - 10https://gerrit.wikimedia.org/r/1270764 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui) [09:01:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90620 and previous config saved to /var/cache/conftool/dbconfig/20260414-090112-fceratto.json [09:01:18] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:02:15] (03CR) 10Btullis: [C:03+1] "Looks good. One tiny nit in the comment." [puppet] - 10https://gerrit.wikimedia.org/r/1270771 (owner: 10Brouberol) [09:03:20] (03CR) 10Brouberol: deployment_server: expand IPs behind the dumps-wikimedia external service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1270771 (owner: 10Brouberol) [09:09:27] (03CR) 10Marostegui: [C:03+2] eqiad.yaml: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270764 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui) [09:09:29] FIRING: ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:09:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:11:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P90621 and previous config saved to /var/cache/conftool/dbconfig/20260414-091122-fceratto.json [09:12:38] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup[1006-1007,1014].eqiad.wmnet with reason: maintenance [09:12:44] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11818387 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b67b68ad-79cc-40ba-b2d3-11ce2438694e) set by j... [09:13:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:14:29] FIRING: [4x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:15:04] (03PS11) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [09:15:47] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [09:16:19] (03CR) 10Federico Ceratto: [C:03+2] admin: Add second U2F key, remove non-U2F SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1269970 (owner: 10Federico Ceratto) [09:17:49] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host backup1006 [09:19:24] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [09:21:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P90623 and previous config saved to /var/cache/conftool/dbconfig/20260414-092130-fceratto.json [09:22:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:23:33] (03PS12) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [09:24:09] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host backup1006 - ayounsi@cumin1003" [09:24:14] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host backup1006 - ayounsi@cumin1003" [09:24:14] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:24:14] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache backup1006.eqiad.wmnet 162.32.64.10.in-addr.arpa 2.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:24:18] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) backup1006.eqiad.wmnet 162.32.64.10.in-addr.arpa 2.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:24:18] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host backup1006 [09:24:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1006 [09:24:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host backup1006 [09:24:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:24:45] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [09:25:49] (03PS3) 10Brouberol: deployment_server: expand IPs behind the dumps-wikimedia external service [puppet] - 10https://gerrit.wikimedia.org/r/1270771 [09:27:48] !log root@cumin1003 START - Cookbook sre.mysql.depool depool db1151: Security updates [09:27:49] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [09:27:51] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:27:57] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:27:57] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1151: Security updates [09:28:23] (03CR) 10Brouberol: [C:03+2] deployment_server: expand IPs behind the dumps-wikimedia external service [puppet] - 10https://gerrit.wikimedia.org/r/1270771 (owner: 10Brouberol) [09:29:23] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db2144: Test depool [09:29:23] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [09:29:29] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:29:29] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2144: Test depool [09:30:55] (03PS1) 10Marostegui: check_private_data_report: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270868 (https://phabricator.wikimedia.org/T423151) [09:31:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90625 and previous config saved to /var/cache/conftool/dbconfig/20260414-093138-fceratto.json [09:31:42] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:31:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc2011: Test depool [09:31:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [09:31:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [09:31:58] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [09:31:58] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2011: Test depool [09:32:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T419635)', diff saved to https://phabricator.wikimedia.org/P90627 and previous config saved to /var/cache/conftool/dbconfig/20260414-093204-fceratto.json [09:32:15] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2011: Test depool [09:32:15] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [09:32:19] (03CR) 10Marostegui: [C:03+2] check_private_data_report: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270868 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui) [09:32:29] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [09:32:29] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2011: Test depool [09:34:14] PROBLEM - MariaDB Replica IO: ms2 on db2144 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1151.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1151.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:35:03] (03PS13) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [09:35:18] federico3: ^* [09:35:50] odd, the script should have silenced it [09:37:03] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [09:38:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1151.eqiad.wmnet with reason: T419961 [09:39:29] FIRING: [4x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:11] (03PS1) 10Elukey: debmonitor: use chained TLS cert for server and client [puppet] - 10https://gerrit.wikimedia.org/r/1270870 (https://phabricator.wikimedia.org/T420993) [09:41:20] PROBLEM - MariaDB Replica Lag: ms2 on db2144 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:41:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:25] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/output/1270870/8408/" [puppet] - 10https://gerrit.wikimedia.org/r/1270870 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:43:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1270870 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:44:02] (03CR) 10Volans: [C:03+1] "ship it" [puppet] - 10https://gerrit.wikimedia.org/r/1270870 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:44:11] (03CR) 10Elukey: [C:03+2] debmonitor: use chained TLS cert for server and client [puppet] - 10https://gerrit.wikimedia.org/r/1270870 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [09:44:23] (03PS2) 10Federico Ceratto: sre.mysql.pool: Handle private tasks exception [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) [09:45:34] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2144.codfw.wmnet with reason: T419961 [09:46:03] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2011: Test depool [09:46:03] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [09:46:13] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:46:13] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2011: Test depool [09:46:29] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc2011: Test depool [09:46:29] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [09:46:37] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:46:37] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2011: Test depool [09:46:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2011: Test depool [09:46:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [09:47:04] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:47:04] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2011: Test depool [09:47:41] (03CR) 10Federico Ceratto: "I updated it to support parsercache as well and tested that." [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto) [09:49:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T419635)', diff saved to https://phabricator.wikimedia.org/P90631 and previous config saved to /var/cache/conftool/dbconfig/20260414-094926-fceratto.json [09:49:29] RESOLVED: [4x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:49:31] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:50:40] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: add some more IPs of large-scale NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270766 (owner: 10Daniel Kinzler) [09:50:49] (03CR) 10Marostegui: "So this is all tested and works?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto) [09:52:35] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: try per-minute limits in shadow-mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270767 (owner: 10Daniel Kinzler) [09:56:54] !log rotated debmonitor client and server certs fleetwide for intermediate certs rotation - T420993 [09:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:58] T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993 [09:57:59] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: try per-minute limits in shadow-mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270767 (owner: 10Daniel Kinzler) [09:58:15] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: add some more IPs of large-scale NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270766 (owner: 10Daniel Kinzler) [09:58:23] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: handle percent-escaped pipes in query params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270514 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [09:59:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P90632 and previous config saved to /var/cache/conftool/dbconfig/20260414-095934-fceratto.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1000) [10:00:39] (03Merged) 10jenkins-bot: rest gateway: handle percent-escaped pipes in query params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270514 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [10:00:45] (03Merged) 10jenkins-bot: rest-gateway: add some more IPs of large-scale NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270766 (owner: 10Daniel Kinzler) [10:00:47] (03Merged) 10jenkins-bot: rest-gateway: try per-minute limits in shadow-mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270767 (owner: 10Daniel Kinzler) [10:02:58] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:03:20] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:04:03] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:04:36] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:06:16] (03PS14) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [10:06:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1168.eqiad.wmnet with OS trixie [10:07:19] (03PS1) 10STran: Deploy IRS to enwiki's Event Talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270872 (https://phabricator.wikimedia.org/T423042) [10:07:26] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1168: after reimage to trixie [10:08:28] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [10:08:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:09:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:09:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P90634 and previous config saved to /var/cache/conftool/dbconfig/20260414-100942-fceratto.json [10:09:46] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [10:10:36] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1168: after reimage to trixie [10:10:39] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:11:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Fully repool db1168', diff saved to https://phabricator.wikimedia.org/P90635 and previous config saved to /var/cache/conftool/dbconfig/20260414-101119-marostegui.json [10:12:46] (03CR) 10Marostegui: [C:03+2] Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270866 (owner: 10Marostegui) [10:12:49] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [10:13:36] (03CR) 10CI reject: [V:04-1] Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [10:13:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet [10:14:06] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:14:29] !log fceratto@cumin2002 dbctl commit (dc=all): 'Pool in', diff saved to https://phabricator.wikimedia.org/P90636 and previous config saved to /var/cache/conftool/dbconfig/20260414-101428-fceratto.json [10:16:07] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:18:32] (03PS1) 10Elukey: admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) [10:19:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet [10:20:49] (03CR) 10Anzx: Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [10:21:00] !log install cumin v6.0.0 on cumin1003 (last host remained to upgrade) [10:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:17] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:23:48] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:24:27] (03PS2) 10STran: Deploy IRS to enwiki's Event Talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270872 (https://phabricator.wikimedia.org/T423042) [10:24:52] !log volans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest1005.eqiad.wmnet with reason: Testing cumin v6.0.0 [10:25:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet [10:34:28] (03CR) 10Mszwarc: [C:03+1] Deploy IRS to enwiki's Event Talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270872 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [10:34:37] (03PS1) 10D3r1ck01: Remove temporary `wgOAuth2UsePrefixedSub` feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270882 (https://phabricator.wikimedia.org/T417690) [10:36:32] (03PS4) 10Arnaudb: gerrit: migrate gerrit_site away from root partition [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) [10:36:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:37:27] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [10:39:56] (03PS15) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [10:41:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:44:26] (03CR) 10Brouberol: [C:03+1] "Diff looks good. I trust you on whether this is enough" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [10:44:53] (03PS1) 10Muehlenhoff: debdeploy: Bump changelog for new release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1270883 [10:46:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:47:46] (03PS4) 10Bodhisattwa: Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 [10:49:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) (owner: 10Stang) [10:51:01] (03CR) 10Jelto: [C:03+1] "lgtm, I also like reducing the number of rsync server modules and bacula filesets" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [10:51:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:53:29] (03PS1) 10STran: Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) [10:53:41] (03PS1) 10STran: Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270888 (https://phabricator.wikimedia.org/T423042) [10:54:12] (03CR) 10STran: "We want to deploy to enwiki on Wed so this needs to go in at the same time" [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [10:54:13] (03PS5) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) [10:54:22] RECOVERY - MariaDB Replica IO: ms2 on db2144 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:54:22] RECOVERY - MariaDB Replica Lag: ms2 on db2144 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:54:40] (03CR) 10CI reject: [V:04-1] Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [10:54:41] (03PS4) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) [10:54:52] (03PS16) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [10:54:55] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1151: Security update [10:54:55] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [10:54:57] (03CR) 10JMeybohm: "Wouldn't it be enough to do this on one staging cluster rather then all of them?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [10:55:04] (03CR) 10CI reject: [V:04-1] Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270888 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [10:55:05] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [10:55:05] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1151: Security update [10:55:13] (03CR) 10JMeybohm: [C:04-1] admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [10:56:22] !log root@cumin1003 START - Cookbook sre.mysql.depool depool db1153: Security updates [10:56:22] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [10:56:24] (03CR) 10Clément Goubert: "@tklausmann@wikimedia.org @kbazira@wikimedia.org I'm tagging you on this as owners of the backend services, could you check that the URL p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [10:56:30] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [10:56:30] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1153: Security updates [10:56:51] (03CR) 10Clément Goubert: "@tklausmann@wikimedia.org @kbazira@wikimedia.org I'm tagging you on this as owners of the backend services, could you check that the URL p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [10:57:08] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [10:57:41] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [10:57:44] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM, no uses since wmf.22:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150) (owner: 10Daimona Eaytoy) [10:58:56] (03CR) 10Anzx: [C:03+1] Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [10:59:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:59:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [10:59:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90639 and previous config saved to /var/cache/conftool/dbconfig/20260414-105920-fceratto.json [10:59:24] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:00:05] (03PS17) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [11:00:52] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye [11:02:08] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [11:02:21] (03CR) 10STran: "recheck" [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [11:04:28] PROBLEM - MariaDB Replica IO: ms3 on db2143 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1153.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1153.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:04:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90640 and previous config saved to /var/cache/conftool/dbconfig/20260414-110432-fceratto.json [11:04:36] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:07:28] RECOVERY - MariaDB Replica IO: ms3 on db2143 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:12:48] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:13:42] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:14:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P90641 and previous config saved to /var/cache/conftool/dbconfig/20260414-111440-fceratto.json [11:15:02] PROBLEM - MariaDB Replica IO: ms3 on db1153 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2143.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2143.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:16:18] (03PS1) 10FNegri: mariadb: wiki-replicas: add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1270891 (https://phabricator.wikimedia.org/T422806) [11:16:48] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270892 [11:19:02] RECOVERY - MariaDB Replica IO: ms3 on db1153 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:21:10] PROBLEM - MariaDB Replica Lag: s8 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 553.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:22:39] (03PS2) 10FNegri: mariadb: wiki-replicas: add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1270891 (https://phabricator.wikimedia.org/T422806) [11:24:05] !log root@cumin1003 START - Cookbook sre.mysql.pool pool db1153: Security updates [11:24:05] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [11:24:18] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:24:18] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1153: Security updates [11:24:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P90643 and previous config saved to /var/cache/conftool/dbconfig/20260414-112448-fceratto.json [11:26:14] mvernon@cumin2002 convert-disks (PID 3035179) is awaiting input [11:26:27] (03PS18) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [11:33:55] (03PS1) 10Matthias Mullie: Enable Extension:ReaderExperiments on itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270897 (https://phabricator.wikimedia.org/T423173) [11:34:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90644 and previous config saved to /var/cache/conftool/dbconfig/20260414-113456-fceratto.json [11:35:01] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:35:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [11:35:07] (03Abandoned) 10Matthias Mullie: Enable Extension:ReaderExperiments on itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270897 (https://phabricator.wikimedia.org/T423173) (owner: 10Matthias Mullie) [11:35:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T419635)', diff saved to https://phabricator.wikimedia.org/P90645 and previous config saved to /var/cache/conftool/dbconfig/20260414-113510-fceratto.json [11:37:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T419635)', diff saved to https://phabricator.wikimedia.org/P90646 and previous config saved to /var/cache/conftool/dbconfig/20260414-113721-fceratto.json [11:37:37] (03PS1) 10MVernon: partman: also add ms-be206[8-9] to partman_early_command [puppet] - 10https://gerrit.wikimedia.org/r/1270899 (https://phabricator.wikimedia.org/T354872) [11:39:35] (03PS1) 10Daniel Kinzler: redioscope: capture rate limit window duration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270901 [11:41:31] (03PS1) 10JMeybohm: kubernetes: Remove absent rsyslog config: block-docker-mount-spam [puppet] - 10https://gerrit.wikimedia.org/r/1270903 (https://phabricator.wikimedia.org/T395870) [11:42:21] (03CR) 10Slyngshede: P:tofurkey Add tofurkey (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [11:45:59] !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool esams [reason: network maintenance, T416450] [11:46:03] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [11:46:22] (03CR) 10Klausman: [C:03+1] "Added a few minor thoughts, nothing that really needs addressing right now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [11:46:33] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool esams [reason: network maintenance, T416450] [11:47:01] !log ayounsi@cumin1003 conftool action : set/pooled=no; selector: cluster=dnsbox,dc=esams [reason: esams maintenance over] [11:47:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P90647 and previous config saved to /var/cache/conftool/dbconfig/20260414-114732-fceratto.json [11:47:45] jouncebot: nowandnext [11:47:46] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [11:47:46] In 0 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1200) [11:48:24] (03PS1) 10STran: Update webonyx/graphql-php to 15.31.5 [vendor] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270902 (https://phabricator.wikimedia.org/T423216) [11:49:54] (03PS1) 10STran: Update webonyx/graphql-php to 15.31.5 [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270905 (https://phabricator.wikimedia.org/T423216) [11:52:08] (03CR) 10Clément Goubert: [C:03+1] kubernetes: Remove absent rsyslog config: block-docker-mount-spam [puppet] - 10https://gerrit.wikimedia.org/r/1270903 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm) [11:53:05] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270448 (owner: 10Muehlenhoff) [11:54:39] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [11:55:30] (03CR) 10Klausman: amg-gpu: Set up explicit GPU partitioning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [11:55:36] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [11:56:40] (03CR) 10Federico Ceratto: [C:03+1] "The glob matches the description." [puppet] - 10https://gerrit.wikimedia.org/r/1270899 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [11:57:25] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr1-esams,cr1-esams IPv6,re0.cr1-esams.mgmt with reason: router upgrade [11:57:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P90648 and previous config saved to /var/cache/conftool/dbconfig/20260414-115739-fceratto.json [11:57:45] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [11:57:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T410589)', diff saved to https://phabricator.wikimedia.org/P90649 and previous config saved to /var/cache/conftool/dbconfig/20260414-115752-ladsgroup.json [11:57:59] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [11:58:07] (03CR) 10Vgutierrez: [C:04-1] "nice work, see inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [11:58:26] (03CR) 10MVernon: [C:03+2] partman: also add ms-be206[8-9] to partman_early_command [puppet] - 10https://gerrit.wikimedia.org/r/1270899 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1200) [12:01:04] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [12:01:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11818956 (10klausman) I think we can run this machine one a single disk until its replacement arrives. Even if it dies entirely, we have enough serving capacity in eqiad to handl... [12:01:52] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [12:02:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T410589)', diff saved to https://phabricator.wikimedia.org/P90650 and previous config saved to /var/cache/conftool/dbconfig/20260414-120200-ladsgroup.json [12:02:16] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [12:02:17] !log root@cumin1003 START - Cookbook sre.mysql.depool depool db1151: Security updates [12:02:18] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [12:02:26] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [12:02:26] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1151: Security updates [12:03:03] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [12:05:48] (03Abandoned) 10Tchanders: Add Special:GlobalContributions to no-IP reveal pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530) (owner: 10Tchanders) [12:05:59] (03CR) 10Klausman: [C:03+1] rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [12:06:18] (03PS19) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [12:07:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T419635)', diff saved to https://phabricator.wikimedia.org/P90652 and previous config saved to /var/cache/conftool/dbconfig/20260414-120747-fceratto.json [12:07:49] (03CR) 10Klausman: [C:03+1] "LGTM, with one note: The API GW/Envoy does not do path normalization, hence the `(:|%3A|%3a)` regex. _If_ normaliztion is done in the new " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [12:07:52] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:08:05] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [12:08:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T419635)', diff saved to https://phabricator.wikimedia.org/P90653 and previous config saved to /var/cache/conftool/dbconfig/20260414-120812-fceratto.json [12:08:58] PROBLEM - MariaDB Replica IO: ms2 on db2144 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1151.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1151.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:11:58] RECOVERY - MariaDB Replica IO: ms2 on db2144 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:12:56] PROBLEM - Host ganeti6004 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:56] PROBLEM - Host durum6001 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:56] PROBLEM - Host ganeti6002 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:56] PROBLEM - Host durum6002 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:08] RECOVERY - Host ganeti6002 is UP: PING OK - Packet loss = 0%, RTA = 87.51 ms [12:13:24] RECOVERY - Host ganeti6004 is UP: PING OK - Packet loss = 0%, RTA = 89.09 ms [12:13:24] RECOVERY - Host durum6001 is UP: PING OK - Packet loss = 0%, RTA = 89.36 ms [12:13:24] RECOVERY - Host durum6002 is UP: PING OK - Packet loss = 0%, RTA = 89.32 ms [12:13:30] (03CR) 10Mszwarc: Update webonyx/graphql-php to 15.31.5 (033 comments) [vendor] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270902 (https://phabricator.wikimedia.org/T423216) (owner: 10STran) [12:14:43] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [12:14:48] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:14:52] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:15:48] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:15:52] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:16:07] (03PS3) 10Arnaudb: gerrit: update sync-instances cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) [12:16:10] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:16:13] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:16:22] !ack [12:16:22] 7836 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [12:16:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:17:07] XioNoX: all expected? cr1-drmrs to cr2-eqiad seems unrelated [12:17:08] best time for a drmrs transport link failure... [12:17:16] sigh [12:17:21] Ugh [12:17:37] let's repool esams, I haven't rebooted any router yet for upgrade, just prepared things [12:17:46] sounds good, thx [12:17:52] !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool esams [reason: network maintenance paused, T416450] [12:17:53] Ack. Ping if you need help [12:17:54] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool esams [reason: network maintenance paused, T416450] [12:17:55] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [12:18:14] (03CR) 10JMeybohm: [C:04-1] "As said on IRC: These probably need to go into a new file with:" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [12:18:24] we've lost just redundancy on drmrs right? why the page? [12:18:52] I see all back to 100% on the ha proxy graph, I guess just reconciliation time [12:18:56] PROBLEM - MariaDB Replica IO: ms2 on db1151 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2144.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2144.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:19:15] volans: yeah quick blip during the network convergence I guess [12:19:26] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11819173 (10MoritzMuehlenhoff) [12:19:50] looks like it's back up? [12:19:52] do you want us to do anything for drmrs or are you taking care of it? [12:20:19] (03CR) 10Mszwarc: [C:03+1] Update webonyx/graphql-php to 15.31.5 [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270905 (https://phabricator.wikimedia.org/T423216) (owner: 10STran) [12:20:21] I'm having a look at the transport link, but I'll let you look at the HAproxy alert [12:20:54] haproxy is back to normal AFAICS [12:21:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:21:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:21:39] RESOLVED: [3x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:21:44] yeah... just a blip [12:21:53] (03CR) 10JMeybohm: [C:03+2] kubernetes: Remove absent rsyslog config: block-docker-mount-spam [puppet] - 10https://gerrit.wikimedia.org/r/1270903 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm) [12:21:54] I guess I'll re-depool esams.... [12:21:58] with perfect timing [12:22:06] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270912 [12:22:12] !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool esams [reason: network maintenance continue, T416450] [12:22:14] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool esams [reason: network maintenance continue, T416450] [12:22:55] yeah, just annoying, we're < 8Gbps of transport traffic with esams+drmrs, so it's "just" annoying, no real user impact [12:22:56] RECOVERY - MariaDB Replica IO: ms2 on db1151 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:24:09] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [12:24:32] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270912 (owner: 10PipelineBot) [12:26:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T419635)', diff saved to https://phabricator.wikimedia.org/P90654 and previous config saved to /var/cache/conftool/dbconfig/20260414-122603-fceratto.json [12:26:06] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11819211 (10MoritzMuehlenhoff) [12:26:07] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:26:36] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270912 (owner: 10PipelineBot) [12:27:20] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:28:05] !log root@cumin1003 START - Cookbook sre.mysql.pool pool db1151: Security updates [12:28:05] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [12:28:18] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [12:28:18] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1151: Security updates [12:28:21] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:28:40] (03PS1) 10Dreamy Jazz: [WIP] Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) [12:28:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:32:33] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:33:35] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:33:52] !log cr1-esams - request chassis routing-engine master switch - T416450 [12:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:55] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [12:34:41] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:34:59] (03CR) 10JMeybohm: [C:03+1] role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [12:35:16] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:35:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270905 (https://phabricator.wikimedia.org/T423216) (owner: 10STran) [12:35:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [12:36:07] (03Abandoned) 10STran: Update webonyx/graphql-php to 15.31.5 [vendor] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270902 (https://phabricator.wikimedia.org/T423216) (owner: 10STran) [12:36:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P90656 and previous config saved to /var/cache/conftool/dbconfig/20260414-123611-fceratto.json [12:36:32] (03CR) 10JMeybohm: service::catalog: add sophroid service catalog entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [12:36:34] (03Abandoned) 10STran: Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [12:37:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:37:54] (03CR) 10JMeybohm: [C:03+1] rest-gateway: Add liftwing listeners and network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [12:38:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:38:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:38:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:39:44] interfaces are slowly coming back up [12:41:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [12:41:48] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:41:52] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:42:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:42:59] cr1-esams is backup, keeping an eye on drmrs [12:43:10] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:43:14] is a backup or is back up? :-P [12:43:39] FIRING: [9x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:43:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:43:52] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:44:19] volans: back up on the backup re :) now pushing the upgrade on the primary RE, then will do another (and last) RE-switch [12:44:29] :D [12:44:48] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:46:16] (03CR) 10JMeybohm: [C:03+1] "Looks plausible" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [12:46:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P90657 and previous config saved to /var/cache/conftool/dbconfig/20260414-124620-fceratto.json [12:46:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:47:33] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11819359 (10Ladsgroup) Okay, after four tries (!) we got envoy to work. Now... [12:47:35] while I think about it, as esams is running routed ganeti, can we move all the VMs from one rack to the other before I do the switch reboot to reduce the impact of that switch reboot? [12:47:41] moritzm: ^ [12:47:51] RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:48:10] RESOLVED: [6x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:48:39] RESOLVED: [12x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:48:54] (03PS1) 10Dreamy Jazz: [WIP] Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) [12:48:57] jouncebot: nowandnext [12:48:58] For the next 0 hour(s) and 11 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1200) [12:48:58] In 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1300) [12:49:19] (03PS2) 10Dreamy Jazz: Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) [12:49:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) (owner: 10Dreamy Jazz) [12:49:42] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270447 (owner: 10PipelineBot) [12:49:47] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270600 (owner: 10PipelineBot) [12:49:52] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270892 (owner: 10PipelineBot) [12:50:20] XioNoX: which virt nodes need to be emptied? [12:50:56] (03PS3) 10Dreamy Jazz: Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) [12:51:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:52:12] moritzm: 3006/3008 first (rack BW27) then 3005/3007 (rack By27) [12:52:19] or the other way around, doesn't matter :) [12:53:53] we can do that, but isn't esams going to be depooled anyway? [12:54:19] (03PS4) 10Dreamy Jazz: Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) [12:54:48] moritzm: the site is already depooled, so, like we did for drmrs, we can jsut downtime the hosts/vms and reboot the switches. [12:55:01] but I'm wondering if it's not cleaner to migrate them [12:55:12] no strong opinion here [12:56:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T419635)', diff saved to https://phabricator.wikimedia.org/P90658 and previous config saved to /var/cache/conftool/dbconfig/20260414-125628-fceratto.json [12:56:34] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [12:56:37] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:56:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T419635)', diff saved to https://phabricator.wikimedia.org/P90659 and previous config saved to /var/cache/conftool/dbconfig/20260414-125642-fceratto.json [12:58:46] PROBLEM - Host re0.cr1-esams.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:59:17] (03PS2) 10Elukey: admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) [12:59:32] XioNoX: the current cookbook only drains one node at a time, so w/o manual intervention we can't move all VMs to e.g. 3006/3008, after the first is drained, draining th second would also select nodes on the first node again [12:59:39] (03CR) 10Elukey: "Updated!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:59:46] I see, ok [12:59:48] so I'd say let's downtime and reboot then [12:59:50] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr1-esams,cr1-esams IPv6,re0.cr1-esams.mgmt with reason: router upgrade [12:59:58] sounds good [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1300) [13:00:05] Daimona, kipfel, Tran, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] \o [13:00:12] o/ [13:00:13] o/ [13:00:29] o/ I could deploy in 10 minutes or so if needed [13:00:38] (Depending on how long others take, I may need to do mine later as I need to go before the end of the window) [13:01:06] !log cr1-esams - request chassis routing-engine master switch - T416450 [13:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:09] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [13:01:43] (03PS1) 10Muehlenhoff: Depool puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1270924 [13:01:45] Daimona: You around (your first)? [13:01:48] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:01:52] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:01:59] (03PS1) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) [13:02:28] (03CR) 10Volans: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1270924 (owner: 10Muehlenhoff) [13:02:43] (03PS1) 10Marostegui: db2214: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270926 [13:03:05] (03PS1) 10Arnaudb: mailman: test httpd config before reloading [puppet] - 10https://gerrit.wikimedia.org/r/1270921 (https://phabricator.wikimedia.org/T323208) [13:03:25] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:03:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2214.codfw.wmnet with reason: Reimage to Trixie [13:03:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2214: Reimage to Trixie [13:03:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:03:42] Looks like they are not here [13:03:48] RECOVERY - Host re0.cr1-esams.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.69 ms [13:03:50] (03CR) 10Marostegui: [C:03+2] db2214: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270926 (owner: 10Marostegui) [13:03:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:03:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2214: Reimage to Trixie [13:04:19] Dreamy_Jazz: yup, a bit late but around, sorry! [13:04:40] FIRING: [7x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:04:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:04:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:05:13] (03CR) 10Muehlenhoff: [C:03+2] Depool puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1270924 (owner: 10Muehlenhoff) [13:05:25] !log jmm@dns1004 START - running authdns-update [13:05:43] Daimona: You self deploying or need someone? [13:05:48] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:52] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:06:00] (I forget who has server access) [13:06:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:25] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2214.codfw.wmnet with OS trixie [13:06:34] I'm not a deployer so I'd need help :) [13:06:37] (03PS1) 10Elukey: Update the systemd units to wait for udev before starting [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1270927 [13:06:47] !log jmm@dns1004 END - running authdns-update [13:06:48] cr1-esams is back up [13:07:02] yay [13:07:15] (03CR) 10Dreamy Jazz: [C:03+2] Stop setting $wgCampaignEventsEnableEventGoals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150) (owner: 10Daimona Eaytoy) [13:07:42] (03CR) 10Dreamy Jazz: [C:03+2] Revert "zhwiki: Temporary Logo Change for WP25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) (owner: 10Stang) [13:08:12] (03Merged) 10jenkins-bot: Stop setting $wgCampaignEventsEnableEventGoals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150) (owner: 10Daimona Eaytoy) [13:08:25] FIRING: [7x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:08:27] Tran: Do you mind if I do mine first before yours? [13:08:30] I'll deploy the others [13:08:33] Sure [13:08:34] (03Merged) 10jenkins-bot: Revert "zhwiki: Temporary Logo Change for WP25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) (owner: 10Stang) [13:08:39] FIRING: [12x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:08:40] (03CR) 10JMeybohm: [C:03+1] admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:08:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) (owner: 10Dreamy Jazz) [13:09:00] (03PS1) 10Btullis: Prepare dse-k8s-ctrl servers for ipip migration [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) [13:09:09] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [13:09:40] RESOLVED: [7x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:09:51] RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:09:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:09:54] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr2-esams,cr2-esams IPv6,cr2-esams.mgmt with reason: router upgrade [13:09:58] (03Merged) 10jenkins-bot: Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) (owner: 10Dreamy Jazz) [13:10:23] (03PS1) 10Marostegui: Revert "db2214: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270930 [13:10:26] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1270490|Stop setting $wgCampaignEventsEnableEventGoals (T414150)]], [[gerrit:1268293|Revert "zhwiki: Temporary Logo Change for WP25" (T414299)]], [[gerrit:1270918|Enable VisualEditor hCaptcha on testwiki (T423252)]] [13:10:33] T414150: Drop feature flag for event goals - https://phabricator.wikimedia.org/T414150 [13:10:33] T414299: Requesting temporary logo change for zh.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414299 [13:10:34] T423252: Enable VisualEditor hCaptcha on testwiki - https://phabricator.wikimedia.org/T423252 [13:10:50] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282 (10MoritzMuehlenhoff) 03NEW [13:10:51] (03CR) 10Elukey: [C:03+2] admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:11:23] (03PS1) 10Ladsgroup: envoy: Add 1 retry for swift services [puppet] - 10https://gerrit.wikimedia.org/r/1270931 (https://phabricator.wikimedia.org/T328872) [13:12:08] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2068.codfw.wmnet with OS bullseye [13:12:08] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2068 [13:12:20] !log dreamyjazz@deploy1003 daimona, stang, dreamyjazz: Backport for [[gerrit:1270490|Stop setting $wgCampaignEventsEnableEventGoals (T414150)]], [[gerrit:1268293|Revert "zhwiki: Temporary Logo Change for WP25" (T414299)]], [[gerrit:1270918|Enable VisualEditor hCaptcha on testwiki (T423252)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:12:22] !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:12:36] !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:12:39] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11819562 (10Volans) Some execption example from the puppetserver logs (cut out as they are pretty long): ` 2026-04-14T12:57:10.834Z ERROR [qtp1196799668-113293]... [13:12:41] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye [13:12:43] * Lucas_WMDE is here now [13:12:49] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11819563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye [13:12:59] !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [13:13:06] !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [13:13:11] Dreamy_Jazz, the logo reverted as expected, LGTM [13:13:11] (03PS14) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) [13:13:23] (03CR) 10Blake: "Acknowledged" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [13:13:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:27] (03PS1) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1270932 (https://phabricator.wikimedia.org/T355446) [13:13:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T419635)', diff saved to https://phabricator.wikimedia.org/P90661 and previous config saved to /var/cache/conftool/dbconfig/20260414-131326-fceratto.json [13:13:30] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:13:32] (03PS1) 10Robertsky: Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) [13:13:39] RESOLVED: [10x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:14:06] (03Abandoned) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1270932 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [13:14:10] !log disable cert-renewal on wikikube staging clusters as a test for the PKI discovery intermediate rollout - To rollback, revert: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1270873 - T420993 [13:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:13] T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993 [13:14:37] Diamona: Any testing to be done? [13:14:46] daimona: [13:14:52] Daimona: [13:15:06] I'll take a quick look [13:15:09] (03CR) 10Marostegui: [C:03+1] mariadb: wiki-replicas: add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1270891 (https://phabricator.wikimedia.org/T422806) (owner: 10FNegri) [13:15:14] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Tue 12 May 2026 12:40:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [13:15:16] My testing is done [13:15:34] !log cr2-esams - request vmhost reboot - T416450 [13:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:37] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [13:15:52] Looks good, thank you [13:15:57] Proceeding [13:16:03] !log dreamyjazz@deploy1003 daimona, stang, dreamyjazz: Continuing with sync [13:16:12] (03PS20) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [13:17:26] (03CR) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [13:19:51] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:19:51] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:19:53] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270490|Stop setting $wgCampaignEventsEnableEventGoals (T414150)]], [[gerrit:1268293|Revert "zhwiki: Temporary Logo Change for WP25" (T414299)]], [[gerrit:1270918|Enable VisualEditor hCaptcha on testwiki (T423252)]] (duration: 09m 27s) [13:19:55] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [13:20:00] T414150: Drop feature flag for event goals - https://phabricator.wikimedia.org/T414150 [13:20:00] T414299: Requesting temporary logo change for zh.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414299 [13:20:00] T423252: Enable VisualEditor hCaptcha on testwiki - https://phabricator.wikimedia.org/T423252 [13:20:05] Tran: Over to you [13:20:11] Thanks [13:21:49] I can self deploy mine, but if anyone has any opinions on https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/1270905 I would appreciate a third pair of eyes. It's a version bump/fix that is blocking another backport I want to do tomorrow so I'm not very familiar with it but it looked safe. [13:21:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/50 (Core: cr2-esams:et-0/0/0 {#30369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:23:25] FIRING: [9x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:23:28] !log on testcommonswiki drop table if exists categorylinks; drop table if exists externallinks; drop table if exists linktarget; drop table if exists collation; drop table if exists imagelinks; drop table if exists iwlinks; drop table if exists existencelinks; (T421914) [13:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P90662 and previous config saved to /var/cache/conftool/dbconfig/20260414-132334-fceratto.json [13:23:37] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [13:23:54] FIRING: [16x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:24:16] * Lucas_WMDE is amazed to see testcommonswiki rise from the dead [13:24:29] Tran: I had a look at the change but I don’t understand the library well enough to really review it… [13:25:18] (03CR) 10Bking: [C:03+2] opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [13:26:33] :D [13:26:37] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2214.codfw.wmnet with reason: host reimage [13:26:41] but I think I would say go ahead? [13:26:50] Lucas_WMDE: From what I can tell, it shouldn't affect any of our current usages and didn't force any follow ups. This will go out in the next branch cut anyway so I think it'll be ok? [13:26:53] k going [13:27:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270905 (https://phabricator.wikimedia.org/T423216) (owner: 10STran) [13:27:17] (03CR) 10Chlod Alejandro: [C:03+1] Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky) [13:27:52] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:27:52] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:28:09] cr2-esams is back up [13:28:30] FIRING: [2x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-bw27-esams.mgmt.esams.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [13:29:09] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr2-esams (185.15.59.158) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:30:40] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:30:45] (03PS2) 10Robertsky: Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) [13:31:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2214.codfw.wmnet with reason: host reimage [13:31:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [13:31:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:31:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/50 (Core: cr2-esams:et-0/0/0 {#30369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:33:35] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: router upgrade [13:33:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P90663 and previous config saved to /var/cache/conftool/dbconfig/20260414-133342-fceratto.json [13:33:46] (03PS2) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) [13:33:54] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr2-esams (185.15.59.158) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:34:21] I'm going to upgrade asw1-bw27, so that's going to bring host offline, hosts are depooled [13:35:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [13:35:37] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw1-bw27-esams,asw1-bw27-esams IPv6,asw1-bw27-esams.mgmt with reason: router upgrade [13:36:07] !log asw1-bw27-esams> request system reboot - T416450 [13:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:11] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [13:36:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:36:49] (03PS3) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) [13:37:50] volans, claime, any idea what's up with this puppet failure in drmrs? [13:37:59] (03Merged) 10jenkins-bot: Update webonyx/graphql-php to 15.31.5 [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270905 (https://phabricator.wikimedia.org/T423216) (owner: 10STran) [13:38:21] XioNoX: if they were related to puppetserver1002 it's recovering as it's been depooled [13:38:24] see -sre [13:38:25] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1270905|Update webonyx/graphql-php to 15.31.5 (T423216)]] [13:38:28] T423216: Update webonyx/graphql-php to 15.31.5 - https://phabricator.wikimedia.org/T423216 [13:38:43] (03CR) 10Clément Goubert: "@mvernon@wikimedia.org was a little worried about retries when we are overwhelmed, I think 1 retry is ok" [puppet] - 10https://gerrit.wikimedia.org/r/1270931 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [13:39:14] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:16] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:18] PROBLEM - Host durum3005 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:18] PROBLEM - Host tcp-proxy3001 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:18] PROBLEM - Host tcp-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:18] PROBLEM - Host durum3006 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:40] I imagine that's the switch reboot? [13:39:57] FIRING: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:00] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 185.15.59.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:40:04] !ack [13:40:05] 7837 (ACKED) [7x] ProbeDown sre (probes/service esams) [13:40:13] !log stran@deploy1003 stran: Backport for [[gerrit:1270905|Update webonyx/graphql-php to 15.31.5 (T423216)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:40:23] was not depooled? [13:40:27] FIRING: [2x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:38] *downtimed [13:40:42] testing now [13:40:52] volans: Do we have "trickling" downtimes ? [13:41:08] (i.e. silencing the switch silences host down for hosts plugged into it?) [13:41:21] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/0 (Core: asw1-bw27-esams:et-0/0/48 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:41:23] claime: lol [13:41:30] FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs3009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [13:41:30] claime: in icinga yes, but not in alert-manager [13:41:35] ^^^ [13:41:35] cdanis: I figured [13:41:43] nothing looks broken, continuing backport [13:41:45] RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:41:47] !log stran@deploy1003 stran: Continuing with sync [13:41:48] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments1009 - https://phabricator.wikimedia.org/T416253#11819656 (10VRiley-WMF) a:05VRiley-WMF→03Jgreen [13:41:52] when the old system is better than the new one :D [13:41:58] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:42:00] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:42:00] claime: I did downtime the hosts with `P{P:netbox::host%location ~ "BW.*esams"}` but of course it didn't downtime the VMs... [13:42:19] XioNoX: ah :D [13:42:24] nor the global services not attached to a host [13:42:32] yeah... [13:42:48] the vms are easy to add ;) [13:43:02] volans: are they ? :) we need to query on which hosts they are [13:43:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T419635)', diff saved to https://phabricator.wikimedia.org/P90664 and previous config saved to /var/cache/conftool/dbconfig/20260414-134350-fceratto.json [13:43:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:43:55] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:44:09] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [13:44:11] we have cluster and group [13:44:16] FIRING: [20x] JobUnavailable: Reduced availability for job benthos in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:44:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T419635)', diff saved to https://phabricator.wikimedia.org/P90665 and previous config saved to /var/cache/conftool/dbconfig/20260414-134416-fceratto.json [13:44:57] FIRING: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:45:30] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270905|Update webonyx/graphql-php to 15.31.5 (T423216)]] (duration: 07m 05s) [13:45:34] T423216: Update webonyx/graphql-php to 15.31.5 - https://phabricator.wikimedia.org/T423216 [13:45:50] (03CR) 10STran: "recheck" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270888 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [13:45:55] FIRING: [5x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:46:09] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:46:11] done and I think that's it for backports this window [13:46:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs3009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [13:46:50] not sure how I can downtime that paging alert ahead of time? [13:47:04] that's the tricky one [13:47:14] you can downtime it but will masquerade other issues [13:47:45] Tran: thanks! (and thanks Dreamy_Jazz too!) [13:47:50] !log UTC afternoon backport+config window done [13:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:41] switch is back up on console, waiting for the ports to show up [13:49:02] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 185.15.59.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:49:16] FIRING: [20x] JobUnavailable: Reduced availability for job benthos in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:49:22] here are the recoveries [13:49:54] RECOVERY - Host durum3005 is UP: PING OK - Packet loss = 0%, RTA = 80.61 ms [13:49:56] RECOVERY - Host tcp-proxy3001 is UP: PING OK - Packet loss = 0%, RTA = 80.77 ms [13:49:56] RECOVERY - Host durum3006 is UP: PING OK - Packet loss = 0%, RTA = 80.59 ms [13:49:57] RECOVERY - Host tcp-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.62 ms [13:49:57] RESOLVED: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:58] RECOVERY - Host ncredir3005 is UP: PING OK - Packet loss = 0%, RTA = 80.70 ms [13:50:16] RECOVERY - Host doh3006 is UP: PING OK - Packet loss = 0%, RTA = 80.71 ms [13:50:27] FIRING: [9x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:28] volans: and how to downtime the VMs in a given rack? [13:50:29] XioNoX: with routed ganeti is indeed harder, the cluster is just one for all vms [13:50:41] yeah [13:50:47] ganeti_cluster: esams03, ganeti_group: B (from hiera) [13:51:21] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/0 (Core: asw1-bw27-esams:et-0/0/48 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:51:30] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs3009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [13:52:00] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:52:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935536 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:52:48] volans: and the only reason it's paging is because the depool threshold is more than a rack, so it sends the LVS healthchecks to the offline realservers... [13:53:07] ah [13:53:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2214.codfw.wmnet with OS trixie [13:54:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2214: after reimage to trixie [13:54:16] RESOLVED: [20x] JobUnavailable: Reduced availability for job benthos in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:54:25] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 12 hosts with reason: router upgrade [13:54:36] I'm going to reboot rack BW27 shortly [13:54:57] XioNoX: cumin 'F:lldp.parent ~ "ganeti300[6,8]\..*"' [13:55:04] for the downtime of the vms of BW [13:55:08] !log mszwarc@deploy1003 mwscript-k8s job started: foreachwikiindblist all backfillInterwikiRightsLog.php --remote-wiki metawiki 20260311190000 # T6055 (third attempt) [13:55:11] T6055: Interwiki rights logs should be duplicated at related wikis - https://phabricator.wikimedia.org/T6055 [13:55:13] you could do the same for the other one with the other ganeti [13:55:22] volans: nice! thanks [13:55:27] RESOLVED: [9x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:55:43] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw1-by27-esams,asw1-by27-esams IPv6,asw1-by27-esams.mgmt with reason: router upgrade [13:56:17] (03PS4) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) [13:56:30] volans: Scheduling downtime on Icinga server alert1002.wikimedia.org for hosts: bast3007,doh3005,hcaptcha-proxy[3001-3002],install3004,ncredir3006,netflow3004,prometheus3004 [13:56:31] nice [13:56:41] :D [13:56:45] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: router upgrade [13:56:52] I need to document it somewhere :) [13:57:26] !log asw1-by27-esams> request system reboot - T416450 [13:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:29] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [13:57:44] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935536 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:57:45] so we should expect a page too right? [13:58:06] volans: yeah, unlike we're lucky and the altermanager probes only arrive on a working backend :) [13:58:10] :D [14:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1400) [14:01:03] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 185.15.59.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:02:00] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:02:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T419635)', diff saved to https://phabricator.wikimedia.org/P90667 and previous config saved to /var/cache/conftool/dbconfig/20260414-140211-fceratto.json [14:02:16] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:04:22] (03PS1) 10Arnaudb: gerrit: access logging with Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1270951 (https://phabricator.wikimedia.org/T421827) [14:04:24] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270951 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [14:07:43] (03PS1) 10Muehlenhoff: Make cn=growthbook-admin managed in Bitu [puppet] - 10https://gerrit.wikimedia.org/r/1270952 (https://phabricator.wikimedia.org/T420688) [14:10:30] (03CR) 10Marostegui: [C:03+2] Revert "db2214: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270930 (owner: 10Marostegui) [14:11:03] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 185.15.59.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:11:10] volans: no page :) [14:11:28] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:11:28] FIRING: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:11:33] too soon? [14:11:35] !ack [14:11:36] 7838 (ACKED) [4x] ProbeDown sre (ip4 probes/service esams) [14:11:38] jinxed it :) [14:11:48] weird, it started paging when the switch came back up :) [14:11:49] hahaha [14:11:52] yeah [14:12:03] XioNoX: I also noticed some imbalance of VMs distribution between the racks, mentioned that to suk.he [14:12:15] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:12:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P90669 and previous config saved to /var/cache/conftool/dbconfig/20260414-141219-fceratto.json [14:12:45] volans: can you link it to https://phabricator.wikimedia.org/T395883 ? [14:13:24] sure [14:13:38] !log ayounsi@cumin1003 START - Cookbook sre.hosts.remove-downtime for 8 hosts [14:13:43] removing the downtimes [14:13:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 8 hosts [14:13:48] but we're all good network wise [14:14:03] !log ayounsi@cumin1003 START - Cookbook sre.hosts.remove-downtime for 12 hosts [14:14:10] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 12 hosts [14:14:17] !log ayounsi@cumin1003 START - Cookbook sre.hosts.remove-downtime for 13 hosts [14:14:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 13 hosts [14:15:25] (03PS1) 10Cwhite: opensearch: correct o11y usage in comment [puppet] - 10https://gerrit.wikimedia.org/r/1270953 (https://phabricator.wikimedia.org/T422860) [14:16:02] repooling esams [14:16:02] XioNoX: commented there [14:16:21] !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool esams [reason: network maintenance finished, T416450] [14:16:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool esams [reason: network maintenance finished, T416450] [14:16:25] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [14:16:28] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:16:28] RESOLVED: [8x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:39] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2068.codfw.wmnet with OS bullseye [14:16:44] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11819787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye execu... [14:16:58] !log ayounsi@cumin1003 conftool action : set/pooled=yes; selector: cluster=dnsbox,dc=esams [reason: esams maintenance over] [14:17:14] !log sukhe@dns1004 START - running authdns-update [14:18:20] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286 (10MatthewVernon) 03NEW [14:18:32] !log sukhe@dns1004 END - running authdns-update [14:20:01] jouncebot: nowandnext [14:20:01] For the next 0 hour(s) and 9 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1400) [14:20:01] In 0 hour(s) and 9 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1430) [14:20:45] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1270952 (https://phabricator.wikimedia.org/T420688) (owner: 10Muehlenhoff) [14:22:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P90670 and previous config saved to /var/cache/conftool/dbconfig/20260414-142227-fceratto.json [14:22:32] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11819840 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgraded to 23.4R2-S8 and all is well. [14:22:54] !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet [14:23:55] (03Abandoned) 10Ayounsi: Temporarily geodns GB and IE to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1268538 (https://phabricator.wikimedia.org/T416450) (owner: 10Ayounsi) [14:24:46] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ms-be2068.codfw.wmnet [14:25:19] !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet [14:25:33] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ms-be2068.codfw.wmnet [14:25:41] !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet [14:26:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2068.codfw.wmnet [14:26:11] !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet [14:28:43] (03CR) 10Muehlenhoff: [C:03+2] Make cn=growthbook-admin managed in Bitu [puppet] - 10https://gerrit.wikimedia.org/r/1270952 (https://phabricator.wikimedia.org/T420688) (owner: 10Muehlenhoff) [14:29:44] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11819873 (10MatthewVernon) At the suggestion of @elukey on IRC, I am trying a firmware downgrade to 6.10.30.20 (the version 5.0.20.0 that this system started wit... [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1430) [14:30:19] (03PS1) 10Andrew Bogott: Remove disk-based ssh keys for me, Andrew Bogott [puppet] - 10https://gerrit.wikimedia.org/r/1270956 [14:30:28] (03CR) 10Clément Goubert: "That's strange, as far as I can tell, ATS does path normalization for `%3A` in [0], and it is in the plugin chain for `/service/` [1] so w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [14:30:53] (03PS1) 10CDanis: SwiftFileBackend: propagate tracing context to HTTP client [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270957 [14:32:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T419635)', diff saved to https://phabricator.wikimedia.org/P90672 and previous config saved to /var/cache/conftool/dbconfig/20260414-143235-fceratto.json [14:32:40] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:32:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [14:33:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T419635)', diff saved to https://phabricator.wikimedia.org/P90673 and previous config saved to /var/cache/conftool/dbconfig/20260414-143301-fceratto.json [14:34:10] anyone from Test Kitchen using this deploy window? [14:34:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1270956 (owner: 10Andrew Bogott) [14:34:36] (03CR) 10Andrew Bogott: [C:03+2] Remove disk-based ssh keys for me, Andrew Bogott [puppet] - 10https://gerrit.wikimedia.org/r/1270956 (owner: 10Andrew Bogott) [14:35:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2068.codfw.wmnet [14:35:07] (03PS1) 10Marostegui: data.yaml: Remove my old key [puppet] - 10https://gerrit.wikimedia.org/r/1270958 [14:35:58] (03PS1) 10Andrew Bogott: cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270959 (https://phabricator.wikimedia.org/T422509) [14:36:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye [14:36:14] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11819920 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye [14:36:32] (03CR) 10CI reject: [V:04-1] cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270959 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott) [14:37:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:38:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1270958 (owner: 10Marostegui) [14:39:12] (03CR) 10Marostegui: [C:03+2] data.yaml: Remove my old key [puppet] - 10https://gerrit.wikimedia.org/r/1270958 (owner: 10Marostegui) [14:39:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2214: after reimage to trixie [14:40:11] (03PS1) 10Andrew Bogott: cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270960 (https://phabricator.wikimedia.org/T422509) [14:40:46] (03CR) 10CI reject: [V:04-1] cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270960 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott) [14:43:15] (03Abandoned) 10Andrew Bogott: cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270959 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott) [14:43:21] (03Abandoned) 10Andrew Bogott: cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270960 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott) [14:44:38] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:44:46] 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11820039 (10Scott_French) [14:48:31] 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11820087 (10Scott_French) @MLechvien-WMF - I've updated the task description to capture the discussion here... [14:49:39] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:49:55] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:49:56] (03PS1) 10Papaul: Add my FIDO backup key [puppet] - 10https://gerrit.wikimedia.org/r/1270962 (https://phabricator.wikimedia.org/T423293) [14:49:59] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:50:23] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:50:46] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:51:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T419635)', diff saved to https://phabricator.wikimedia.org/P90676 and previous config saved to /var/cache/conftool/dbconfig/20260414-145107-fceratto.json [14:51:11] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:51:23] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:51:45] (03PS3) 10Blake: service: exclude apus from the switchover. [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (https://phabricator.wikimedia.org/T422166) [14:51:59] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:53:28] (03CR) 10Elukey: istio: revisit Prometheus buckets for Wikikube (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [14:53:53] (03PS1) 10Ayounsi: move-vlan cookbook: add "inplace" support [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965 [14:54:20] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [14:54:23] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:54:59] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:04] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:56:23] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:56:39] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:56:59] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:59:02] (03CR) 10Scott French: [C:03+1] "Thanks, Blake!" [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (https://phabricator.wikimedia.org/T422166) (owner: 10Blake) [14:59:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [14:59:36] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11820175 (10WMDE-leszek) 05Resolved→03Open hello @MatthewVernon, hello all. I don't know if you'll able to troubleshoot it somehow but it seems that `nicho... [14:59:40] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:00:00] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1500). Please do the needful. [15:01:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P90677 and previous config saved to /var/cache/conftool/dbconfig/20260414-150115-fceratto.json [15:01:32] (03PS1) 10Federico Ceratto: sre.mysql.depool: Do not require tmux/screen [cookbooks] - 10https://gerrit.wikimedia.org/r/1270963 [15:02:20] (03PS1) 10Daniel Kinzler: rest-gateway: anon-browser -> 200 (shadow) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270968 [15:03:57] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: anon-browser -> 200 (shadow) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270968 (owner: 10Daniel Kinzler) [15:04:13] (03CR) 10Clément Goubert: [C:03+1] redioscope: capture rate limit window duration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270901 (owner: 10Daniel Kinzler) [15:05:03] (03CR) 10Jasmine: "Thanks, done!" [dns] - 10https://gerrit.wikimedia.org/r/1265480 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [15:05:22] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:06:00] jouncebot: nowandnext [15:06:00] For the next 0 hour(s) and 53 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1500) [15:06:00] In 0 hour(s) and 53 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1600) [15:06:07] (03PS4) 10Blake: service: exclude apus from the switchover. [puppet] - 10https://gerrit.wikimedia.org/r/1269382 [15:06:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:07:00] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:07:21] (03PS5) 10Blake: service: exclude apus from the switchover. [puppet] - 10https://gerrit.wikimedia.org/r/1269382 [15:07:34] (03CR) 10Blake: service: exclude apus from the switchover. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (owner: 10Blake) [15:08:49] (03CR) 10Clément Goubert: [C:03+1] service: exclude apus from the switchover. [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (owner: 10Blake) [15:09:06] (03CR) 10Daniel Kinzler: [C:03+2] redioscope: capture rate limit window duration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270901 (owner: 10Daniel Kinzler) [15:09:20] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: anon-browser -> 200 (shadow) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270968 (owner: 10Daniel Kinzler) [15:10:00] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:10:04] (03PS1) 10Vgutierrez: admin: Remove legacy ssh key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/1270969 [15:10:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1270969 (owner: 10Vgutierrez) [15:10:56] (03PS1) 10CDanis: puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270970 [15:11:07] (03Merged) 10jenkins-bot: redioscope: capture rate limit window duration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270901 (owner: 10Daniel Kinzler) [15:11:09] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270970 (owner: 10CDanis) [15:11:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P90678 and previous config saved to /var/cache/conftool/dbconfig/20260414-151123-fceratto.json [15:11:30] (03Merged) 10jenkins-bot: rest-gateway: anon-browser -> 200 (shadow) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270968 (owner: 10Daniel Kinzler) [15:12:00] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:12:19] (03CR) 10Scott French: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (owner: 10Blake) [15:12:34] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:12:43] (03CR) 10Vgutierrez: [C:03+2] admin: Remove legacy ssh key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/1270969 (owner: 10Vgutierrez) [15:12:59] (03CR) 10CI reject: [V:04-1] puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270970 (owner: 10CDanis) [15:13:08] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:13:16] (03PS1) 10CDanis: puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971 [15:13:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:00] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:15:21] (03CR) 10CI reject: [V:04-1] puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis) [15:15:50] !log daniel@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [15:16:00] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11820271 (10MatthewVernon) As before, post-installer boot was fine, but after puppet it gets as far as: ` Booting from Hard drive C: GRUB ` and hangs. [15:16:00] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:16:40] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:17:26] !log daniel@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [15:18:10] (03Abandoned) 10CDanis: puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270970 (owner: 10CDanis) [15:18:12] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:18:41] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:20:04] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:20:06] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:20:55] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:20:58] (03PS1) 10Andrew Bogott: cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270975 (https://phabricator.wikimedia.org/T422509) [15:21:01] (03PS2) 10CDanis: puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971 [15:21:06] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis) [15:21:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T419635)', diff saved to https://phabricator.wikimedia.org/P90679 and previous config saved to /var/cache/conftool/dbconfig/20260414-152132-fceratto.json [15:21:36] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:21:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [15:21:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T419635)', diff saved to https://phabricator.wikimedia.org/P90680 and previous config saved to /var/cache/conftool/dbconfig/20260414-152156-fceratto.json [15:21:58] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T423301 (10Passimacopoulos) 03NEW [15:22:20] (03CR) 10Jasmine: [C:03+2] Add Kubernetes POD IP reverse range delegations for wikikube-ctrl200[4-5] [dns] - 10https://gerrit.wikimedia.org/r/1265480 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [15:22:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:23:02] !log jasmine@dns1004 START - running authdns-update [15:23:25] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11820314 (10MatthewVernon) @WMDE-leszek analytics_privatedata_users isn't an LDAP group, it's a shell group, so it wouldn't appear in the ldap listing (for ins... [15:23:54] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:23:58] (03PS1) 10Daniel Kinzler: redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 [15:24:26] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:24:29] (03PS1) 10CDanis: hiddenparma: add atsuko [labs/private] - 10https://gerrit.wikimedia.org/r/1270977 [15:24:29] !log jasmine@dns1004 END - running authdns-update [15:24:46] (03CR) 10CDanis: [C:03+2] hiddenparma: add atsuko [labs/private] - 10https://gerrit.wikimedia.org/r/1270977 (owner: 10CDanis) [15:25:05] (03CR) 10CDanis: [V:03+2 C:03+2] hiddenparma: add atsuko [labs/private] - 10https://gerrit.wikimedia.org/r/1270977 (owner: 10CDanis) [15:25:24] (03PS2) 10Daniel Kinzler: redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 [15:25:49] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis) [15:26:38] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:27:04] (03CR) 10CI reject: [V:04-1] redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler) [15:29:08] (03PS3) 10CDanis: puppetserver: install cidergrinder, run daily grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971 [15:29:16] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis) [15:30:22] (03CR) 10Klausman: [C:03+1] Update the systemd units to wait for udev before starting [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1270927 (owner: 10Elukey) [15:30:32] (03PS1) 10Volans: sre.ganeti: add new pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) [15:30:59] jouncebot: nowandnext [15:30:59] For the next 0 hour(s) and 29 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1500) [15:30:59] In 0 hour(s) and 29 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1600) [15:31:46] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11820350 (10WMDE-leszek) 05Open→03Resolved ah, thanks @MatthewVernon, I forgot about this detail. Shell groups seem ok then, so I'll close this ticket... [15:32:19] (03CR) 10CDanis: [C:03+2] SwiftFileBackend: propagate tracing context to HTTP client [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270957 (owner: 10CDanis) [15:32:49] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2068.codfw.wmnet with OS bullseye [15:32:55] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11820361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye execu... [15:33:00] !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet [15:33:07] (03PS2) 10Volans: sre.ganeti: add new pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) [15:33:28] (03PS3) 10Volans: sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) [15:34:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270957 (owner: 10CDanis) [15:35:08] (03CR) 10Klausman: [C:03+1] istio: revisit Prometheus buckets for Wikikube (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [15:36:10] mvernon@cumin2002 upgrade-firmware (PID 3326764) is awaiting input [15:36:45] (03Merged) 10jenkins-bot: SwiftFileBackend: propagate tracing context to HTTP client [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270957 (owner: 10CDanis) [15:37:11] !log cdanis@deploy1003 Started scap sync-world: Backport for [[gerrit:1270957|SwiftFileBackend: propagate tracing context to HTTP client]] [15:38:59] !log cdanis@deploy1003 cdanis: Backport for [[gerrit:1270957|SwiftFileBackend: propagate tracing context to HTTP client]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:39:17] hey Amir1 you around? 👀 [15:39:44] cdanis: meeting [15:39:50] (03CR) 10Clément Goubert: "`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:39:51] (03PS1) 10Scott French: Set initialDelaySeconds on aqs-http-gateway direct Cassandra clients [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168) [15:40:40] (03CR) 10Scott French: "Just preparing this in case we decide to go this route." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French) [15:40:46] lol I don't have access to upload to testwiki? [15:41:41] !log cdanis@deploy1003 cdanis: Continuing with sync [15:43:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T419635)', diff saved to https://phabricator.wikimedia.org/P90681 and previous config saved to /var/cache/conftool/dbconfig/20260414-154302-fceratto.json [15:43:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:43:20] 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11820398 (10Blake) I'll merge the exclusion patch and work on updating the docs tomorrow. I'm inclined to... [15:45:35] !log cdanis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270957|SwiftFileBackend: propagate tracing context to HTTP client]] (duration: 08m 24s) [15:46:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11820425 (10BCornwall) We've decided to move forward with this task. Would dcops be willing to handle the NIC revert in lvs1017? [15:47:35] (03PS3) 10Daniel Kinzler: redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 [15:50:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2068.codfw.wmnet [15:52:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye [15:52:17] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11820469 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye [15:53:10] (03PS1) 10Robertsky: Update wikimania wordmark for 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) [15:53:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P90682 and previous config saved to /var/cache/conftool/dbconfig/20260414-155310-fceratto.json [15:54:07] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11820483 (10MatthewVernon) Tried a BIOS upgrade from 2.12.2 to 2.24.0. That didn't cause the system to become bootable, but trying yet another reimage. Plan is... [15:54:11] (03PS2) 10Ayounsi: move-vlan cookbook: add "inplace" support [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965 [15:56:07] (03PS1) 10Jforrester: wmgMonologChannels: Add WikiLambda* sub-channels, all at debug for some quick checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270987 [15:56:07] (03PS1) 10Jforrester: wmgMonologChannels: Reduce WikiLambda* sub-channels logging from debug to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270988 [15:56:36] (03PS5) 10Pppery: Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) [15:56:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky) [15:57:03] (03PS4) 10Volans: sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) [15:57:31] (03CR) 10Pppery: Drop 1.5x logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery) [15:57:42] I'm going to deploy ^ to debug the Abstract Wiki caching issue unless someone shouts. [15:59:39] (03CR) 10CI reject: [V:04-1] sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) (owner: 10Volans) [15:59:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270987 (owner: 10Jforrester) [16:00:05] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:22] (03Merged) 10jenkins-bot: wmgMonologChannels: Add WikiLambda* sub-channels, all at debug for some quick checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270987 (owner: 10Jforrester) [16:01:47] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1270987|wmgMonologChannels: Add WikiLambda* sub-channels, all at debug for some quick checks]] [16:01:52] (03PS5) 10Volans: sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) [16:03:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P90683 and previous config saved to /var/cache/conftool/dbconfig/20260414-160319-fceratto.json [16:03:40] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1270987|wmgMonologChannels: Add WikiLambda* sub-channels, all at debug for some quick checks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:03:43] 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307 (10herron) 03NEW p:05Triage→03Medium [16:04:34] !log jforrester@deploy1003 jforrester: Continuing with sync [16:04:46] (03CR) 10Ayounsi: [C:03+1] "nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) (owner: 10Volans) [16:06:01] (03CR) 10Clément Goubert: redioscope: move survey to service defintion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler) [16:07:23] (03CR) 10Volans: [C:03+2] sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) (owner: 10Volans) [16:08:12] (03PS4) 10Daniel Kinzler: redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 [16:08:12] (03PS1) 10Sbisson: Register ArticleGuidance extension and enable in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270991 (https://phabricator.wikimedia.org/T423295) [16:08:20] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270987|wmgMonologChannels: Add WikiLambda* sub-channels, all at debug for some quick checks]] (duration: 06m 32s) [16:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270988 (owner: 10Jforrester) [16:10:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [16:11:02] 10SRE-SLO, 13Patch-For-Review: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11820559 (10herron) [16:11:27] (03Merged) 10jenkins-bot: sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) (owner: 10Volans) [16:11:43] (03Merged) 10jenkins-bot: wmgMonologChannels: Reduce WikiLambda* sub-channels logging from debug to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270988 (owner: 10Jforrester) [16:12:04] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1270988|wmgMonologChannels: Reduce WikiLambda* sub-channels logging from debug to info]] [16:12:11] (03CR) 10Clément Goubert: [C:03+1] redioscope: move survey to service defintion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler) [16:13:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T419635)', diff saved to https://phabricator.wikimedia.org/P90684 and previous config saved to /var/cache/conftool/dbconfig/20260414-161326-fceratto.json [16:13:31] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:13:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance [16:13:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90685 and previous config saved to /var/cache/conftool/dbconfig/20260414-161351-fceratto.json [16:13:52] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1270988|wmgMonologChannels: Reduce WikiLambda* sub-channels logging from debug to info]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:15:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [16:15:39] (03CR) 10Tjones: [C:03+1] "I only have +1 in this repo, but this looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [16:16:20] (03CR) 10Cwhite: [C:03+2] opensearch: correct o11y usage in comment [puppet] - 10https://gerrit.wikimedia.org/r/1270953 (https://phabricator.wikimedia.org/T422860) (owner: 10Cwhite) [16:16:45] !log jforrester@deploy1003 jforrester: Continuing with sync [16:16:49] (03CR) 10Cwhite: [C:03+2] smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite) [16:17:43] (03PS1) 10Herron: puppet: remove pyrra modules/profiles [puppet] - 10https://gerrit.wikimedia.org/r/1270996 (https://phabricator.wikimedia.org/T423307) [16:17:46] (03PS1) 10Herron: pyrra: remove pyrra/slo/slos dns entries [dns] - 10https://gerrit.wikimedia.org/r/1270995 (https://phabricator.wikimedia.org/T423307) [16:17:48] (03PS2) 10Herron: pyrra: remove configuration for web interface [puppet] - 10https://gerrit.wikimedia.org/r/1270992 (https://phabricator.wikimedia.org/T423307) [16:17:50] (03PS4) 10Herron: pyrra: ensure absent on package and services [puppet] - 10https://gerrit.wikimedia.org/r/1270974 (https://phabricator.wikimedia.org/T423307) [16:18:38] (03CR) 10Daniel Kinzler: redioscope: move survey to service defintion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler) [16:19:08] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:19:16] (03CR) 10Daniel Kinzler: [C:03+2] redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler) [16:20:15] (03CR) 10Tjones: [C:03+1] "+1 is all I got! Seems reasonable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269464 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [16:20:32] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270988|wmgMonologChannels: Reduce WikiLambda* sub-channels logging from debug to info]] (duration: 08m 27s) [16:20:38] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T423301#11820651 (10Aklapper) Hi @Passimacopoulos, welcome to Wikimedia Phabricator! Please also connect your [MediaWiki/SUL account](https://meta.wikimedia.org/wiki/Specia... [16:21:06] (03CR) 10Tjones: [C:03+1] search: add alt. completion indices to test keyword tokenizer (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [16:21:10] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:21:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team: Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312 (10RobH) 03NEW [16:21:30] (03Merged) 10jenkins-bot: redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler) [16:21:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:21:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11820687 (10RobH) [16:21:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11820691 (10RobH) [16:22:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11820694 (10RobH) a:03bking Please update the site.pp file with the insetup role for your team (detailed on https://wiki... [16:23:35] (03CR) 10Scott French: "@eevans@wikimedia.org - Depending on whether there's a straightforward solution to improve gocql client behavior, it probably makes sense " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French) [16:23:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314 (10RobH) 03NEW [16:24:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11820718 (10RobH) a:03bking Please update the site.pp file with the insetup role for your team (detailed on https://wikit... [16:24:21] (03CR) 10ArielGlenn: "I'm missing some cotext I think; could you say something about the time_bucket change? Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler) [16:24:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11820727 (10RobH) [16:28:26] 10SRE-Access-Requests, 13Patch-For-Review: Add Papaul FIDO backup SSH key - https://phabricator.wikimedia.org/T423293#11820743 (10Aklapper) + #SRE-Access-Requests per https://wikitech.wikimedia.org/wiki/Yubikey-SSH-FIDO [16:29:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90686 and previous config saved to /var/cache/conftool/dbconfig/20260414-162945-fceratto.json [16:29:49] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:32:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1270962 (https://phabricator.wikimedia.org/T423293) (owner: 10Papaul) [16:34:04] 06SRE, 07SRE-Unowned, 07Incident Severity 1, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11820786 (10MLechvien-WMF) [16:34:05] !log daniel@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [16:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:47] 06SRE, 07SRE-Unowned, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11820799 (10MLechvien-WMF) [16:34:48] !log daniel@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [16:35:00] !log daniel@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/redioscope: apply [16:35:15] !log daniel@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/redioscope: apply [16:38:22] (03CR) 10Ladsgroup: [C:03+1] "Ack" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [16:38:25] (03Abandoned) 10JHathaway: firewall: add cloud services [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:39:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P90687 and previous config saved to /var/cache/conftool/dbconfig/20260414-163953-fceratto.json [16:42:03] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T423301#11820831 (10Rmaung) Thanks @Aklapper, I'll add this to our docs now. :) [16:42:10] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:43:19] (03Abandoned) 10JHathaway: mailman: send web posting through Spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1249415 (https://phabricator.wikimedia.org/T386559) (owner: 10JHathaway) [16:43:23] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: T421398 [16:43:27] T421398: SystemdUnitFailed - zuul-executor - https://phabricator.wikimedia.org/T421398 [16:43:36] (03Abandoned) 10JHathaway: firewall: add to role::wmcs::instance, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226371 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:43:49] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on zuul2001.codfw.wmnet with reason: T421398 [16:44:05] (03Abandoned) 10JHathaway: firewall: remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1213590 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:44:31] (03PS1) 10Andrew Bogott: cloud-vps vendordata: [puppet] - 10https://gerrit.wikimedia.org/r/1271000 (https://phabricator.wikimedia.org/T422509) [16:44:54] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270975 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott) [16:45:30] (03PS2) 10Andrew Bogott: cloud-vps vendordata: force puppet install during image creation [puppet] - 10https://gerrit.wikimedia.org/r/1271000 (https://phabricator.wikimedia.org/T422509) [16:45:58] (03Abandoned) 10JHathaway: backup1012: add to legacy slugs [cookbooks] - 10https://gerrit.wikimedia.org/r/1193229 (owner: 10JHathaway) [16:46:19] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vendordata: force puppet install during image creation [puppet] - 10https://gerrit.wikimedia.org/r/1271000 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott) [16:46:51] (03Abandoned) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [16:50:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P90688 and previous config saved to /var/cache/conftool/dbconfig/20260414-165001-fceratto.json [16:52:18] (03CR) 10JHathaway: "@vgutierrez@wikimedia.org is this worth doing?" [puppet] - 10https://gerrit.wikimedia.org/r/1192934 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [16:58:06] (03PS2) 10JHathaway: acme-chief: remove hiera purge guard [puppet] - 10https://gerrit.wikimedia.org/r/1192917 (https://phabricator.wikimedia.org/T401858) [16:58:07] (03CR) 10Clément Goubert: [C:03+1] envoy: Add 1 retry for swift services [puppet] - 10https://gerrit.wikimedia.org/r/1270931 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [16:58:25] (03PS1) 10Catrope: Enforce 2FA requirements for phase 1 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271001 (https://phabricator.wikimedia.org/T423118) [16:58:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271001 (https://phabricator.wikimedia.org/T423118) (owner: 10Catrope) [16:59:04] (03CR) 10JHathaway: "@vgutierrez@wikimedia.org please review when you have a moment" [puppet] - 10https://gerrit.wikimedia.org/r/1192917 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [16:59:59] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11820884 (10MatthewVernon) Same failure mode after the BIOS upgrade - post-installer boot is fine, after puppet it gets to: ` Booting from Hard drive C: GRUB ` a... [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1700) [17:00:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90689 and previous config saved to /var/cache/conftool/dbconfig/20260414-170010-fceratto.json [17:00:14] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:00:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2202.codfw.wmnet with reason: Maintenance [17:02:10] (03CR) 10Chlod Alejandro: Update wikimania wordmark for 2026 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) (owner: 10Robertsky) [17:03:13] (03PS1) 10Andrew Bogott: cloud-vps vendordata: typo fix in apt line [puppet] - 10https://gerrit.wikimedia.org/r/1271002 (https://phabricator.wikimedia.org/T422509) [17:03:56] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vendordata: typo fix in apt line [puppet] - 10https://gerrit.wikimedia.org/r/1271002 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott) [17:07:37] !log updating caprica hostlists on cloud-hosts-in cr firewall policies [17:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2203.codfw.wmnet with reason: Maintenance [17:12:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2203 (T419635)', diff saved to https://phabricator.wikimedia.org/P90690 and previous config saved to /var/cache/conftool/dbconfig/20260414-171246-fceratto.json [17:12:50] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:17:38] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-upload_eqiad - 9.2.13 Upgrade () [17:17:43] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-text_eqiad - 9.2.13 Upgrade () [17:19:56] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11820959 (10TheDJ) [17:28:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T419635)', diff saved to https://phabricator.wikimedia.org/P90691 and previous config saved to /var/cache/conftool/dbconfig/20260414-172838-fceratto.json [17:28:42] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:30:01] (03PS1) 10Dzahn: jenkins: allow disabling jenkins even on the manager host [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109) [17:38:31] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [17:38:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P90692 and previous config saved to /var/cache/conftool/dbconfig/20260414-173846-fceratto.json [17:39:08] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:39:10] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:41:10] (03PS1) 10Ladsgroup: src: Fix typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271018 [17:41:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:41:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:44:45] (03PS2) 10Ladsgroup: envoy: Add 1 retry for swift services [puppet] - 10https://gerrit.wikimedia.org/r/1270931 (https://phabricator.wikimedia.org/T328872) [17:44:51] (03CR) 10Ladsgroup: [V:03+2 C:03+2] envoy: Add 1 retry for swift services [puppet] - 10https://gerrit.wikimedia.org/r/1270931 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [17:44:59] (03PS1) 10Dzahn: lists: notify apache2 service when config changes [puppet] - 10https://gerrit.wikimedia.org/r/1271019 (https://phabricator.wikimedia.org/T323208) [17:45:23] jouncebot: nowandnext [17:45:23] For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1700) [17:45:23] In 0 hour(s) and 14 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1800) [17:46:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:47:08] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:47:10] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:48:28] (03CR) 10RLazarus: service::catalog: add sophroid service catalog entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [17:48:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P90693 and previous config saved to /var/cache/conftool/dbconfig/20260414-174854-fceratto.json [17:49:10] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:49:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:51:46] (03CR) 10Dzahn: "I suggest to keep this simple and just do this instead: https://gerrit.wikimedia.org/r/1271019" [puppet] - 10https://gerrit.wikimedia.org/r/1270921 (https://phabricator.wikimedia.org/T323208) (owner: 10Arnaudb) [17:51:49] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2068.codfw.wmnet with OS bullseye [17:51:56] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11821132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye execu... [17:54:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271018 (owner: 10Ladsgroup) [17:54:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:55:51] (03Merged) 10jenkins-bot: src: Fix typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271018 (owner: 10Ladsgroup) [17:56:02] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-upload_eqiad - 9.2.13 Upgrade () [17:56:16] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1271018|src: Fix typos]] [17:58:05] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1271018|src: Fix typos]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:59:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T419635)', diff saved to https://phabricator.wikimedia.org/P90694 and previous config saved to /var/cache/conftool/dbconfig/20260414-175902-fceratto.json [17:59:04] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [17:59:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:59:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance [17:59:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T419635)', diff saved to https://phabricator.wikimedia.org/P90695 and previous config saved to /var/cache/conftool/dbconfig/20260414-175927-fceratto.json [18:00:05] dduvall and dancy: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1800). [18:00:14] o/ [18:00:38] Amir1: Lemme know when you're clear [18:00:53] Almost there [18:00:55] sorry [18:02:44] No prob [18:03:29] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271018|src: Fix typos]] (duration: 07m 13s) [18:03:32] done [18:03:40] dancy: ^ sorry for the wait [18:04:29] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271021 (https://phabricator.wikimedia.org/T420482) [18:04:29] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-text_eqiad - 9.2.13 Upgrade () [18:04:31] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271021 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [18:05:23] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271021 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [18:08:26] (03PS1) 10JHathaway: jhathaway: remove non-fido ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1271025 [18:11:00] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.24 refs T420482 [18:11:04] T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482 [18:12:01] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T423301#11821236 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse [18:12:03] (03PS2) 10Dzahn: lists: notify apache2 service when config changes [puppet] - 10https://gerrit.wikimedia.org/r/1271019 (https://phabricator.wikimedia.org/T323208) [18:14:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T419635)', diff saved to https://phabricator.wikimedia.org/P90696 and previous config saved to /var/cache/conftool/dbconfig/20260414-181416-fceratto.json [18:14:20] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:15:26] PROBLEM - Host ms-be2068 is DOWN: PING CRITICAL - Packet loss = 100% [18:16:33] (03PS3) 10Jforrester: Make abstractwiki a multi-lingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420) [18:17:02] (03CR) 10Jforrester: "Plan is to do this tomorrow, once Wikidata has wmf.24 with the new messages." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420) (owner: 10Jforrester) [18:17:54] (03Abandoned) 10Jforrester: mc: Shift the Wikifunctions MC route from /local/wf/ to //wf-wan/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247687 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [18:22:16] (03CR) 10Mszwarc: [C:03+1] Enforce 2FA requirements for phase 1 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271001 (https://phabricator.wikimedia.org/T423118) (owner: 10Catrope) [18:24:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P90697 and previous config saved to /var/cache/conftool/dbconfig/20260414-182424-fceratto.json [18:26:59] (03PS1) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [18:27:27] (03PS1) 10Andrew Bogott: cloud-vps vendordata: slightly more cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1271029 (https://phabricator.wikimedia.org/T422509) [18:28:32] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vendordata: slightly more cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1271029 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott) [18:28:52] (03CR) 10CI reject: [V:04-1] fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [18:29:03] (03PS5) 10JHathaway: sysctls: add optional module param to sysctl::parameters [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) [18:29:55] (03PS2) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [18:30:03] (03CR) 10JHathaway: sysctls: add optional module param to sysctl::parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [18:31:50] (03CR) 10CI reject: [V:04-1] fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [18:32:47] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [18:34:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P90698 and previous config saved to /var/cache/conftool/dbconfig/20260414-183432-fceratto.json [18:35:27] (03CR) 10Papaul: [C:03+2] Add my FIDO backup key [puppet] - 10https://gerrit.wikimedia.org/r/1270962 (https://phabricator.wikimedia.org/T423293) (owner: 10Papaul) [18:36:22] (03PS3) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [18:36:43] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [18:40:20] (03CR) 10Eevans: [C:03+2] aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:40:57] (03CR) 10CDanis: fundraising_data_import maintenance script wrapper & timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [18:41:01] (03PS10) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) [18:41:01] (03PS10) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) [18:41:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1271025 (owner: 10JHathaway) [18:41:46] (03CR) 10CDanis: "Proof-of-concept -- PTAL :)" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [18:42:10] (03CR) 10JHathaway: [C:03+2] jhathaway: remove non-fido ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1271025 (owner: 10JHathaway) [18:42:55] (03PS1) 10C. Scott Ananian: ParsoidLanguageConverter: convert inside [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271030 (https://phabricator.wikimedia.org/T422961) [18:43:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271030 (https://phabricator.wikimedia.org/T422961) (owner: 10C. Scott Ananian) [18:44:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T419635)', diff saved to https://phabricator.wikimedia.org/P90699 and previous config saved to /var/cache/conftool/dbconfig/20260414-184440-fceratto.json [18:44:44] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:48:31] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:52:04] (03CR) 10Eevans: [C:03+2] aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:54:18] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T423301#11821375 (10Rmaung) Adding my approval as Paris' supervisor if that is needed. Paris will need level 3 access with Kerberos. [19:05:46] (03PS1) 10Dzahn: integration: switch integration-agent-docker VMs to Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) [19:10:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270592 (https://phabricator.wikimedia.org/T423173) (owner: 10Bvibber) [19:11:16] (03CR) 10JHathaway: [C:03+2] reposync: don't enforce ownership after init [puppet] - 10https://gerrit.wikimedia.org/r/993797 (owner: 10JHathaway) [19:13:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:10] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:14:12] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:16:01] FYI, I'll be applying some pending external-services network policy diffs to wikikube clusters [19:16:35] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1026.eqiad.wmnet with reason: Bootstrapping — T412830 [19:16:39] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [19:19:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:19:21] !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [19:19:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:20:11] !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [19:21:07] FIRING: [2x] ProbeDown: Service aqs1026-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:21:25] !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [19:22:04] !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [19:23:39] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [19:24:27] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:26:07] FIRING: [3x] ProbeDown: Service aqs1026-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:27:30] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [19:27:58] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [19:28:02] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T423301#11821462 (10andrea.denisse) [19:30:38] !log applied external-services network policy updates for cassandra-analytics-query-service-storage-[ab]-eqiad (aqs1026) and dumps-wikimedia in wikikube clusters [19:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:09] (03CR) 10Dzahn: gerrit: update sync-instances cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb) [19:33:54] (03CR) 10Dzahn: "Since we do this change at runtime with a curl command we might not need this change at all - unless it's to be permanent." [puppet] - 10https://gerrit.wikimedia.org/r/1270951 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [19:37:40] (03CR) 10Dzahn: [C:04-1] gerrit: migrate gerrit_site away from root partition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [19:38:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:40:43] (03PS1) 10Andrew Bogott: Revert "cloud-vps vendordata: slightly more cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1271034 [19:41:09] (03CR) 10Dzahn: "not possible to compile changes here? Hosts that were skipped (fail fast)" [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:41:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270991 (https://phabricator.wikimedia.org/T423295) (owner: 10Sbisson) [19:41:57] (03CR) 10Andrew Bogott: [C:03+2] Revert "cloud-vps vendordata: slightly more cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1271034 (owner: 10Andrew Bogott) [19:49:10] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:49:12] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:49:33] (03CR) 10Eevans: [C:03+1] "@swfrench@wikimedia.org agreed!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French) [19:50:10] FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:50:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:53:26] RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:54:06] (03PS2) 10Dzahn: jenkins: allow disabling jenkins even on the manager host [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109) [19:55:10] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:55:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T2000). [20:00:05] maryum, Robertsky, RoanKattouw, cscott, and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:26] I'm planning to deploy with spiderpig [20:00:46] (03PS1) 10C. Scott Ananian: LanguageConverter: Allow disabling top-level variant "guess" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271038 (https://phabricator.wikimedia.org/T419328) [20:00:50] o/ [20:00:56] i'm also going to spiderpig it [20:00:57] o/ [20:01:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271038 (https://phabricator.wikimedia.org/T419328) (owner: 10C. Scott Ananian) [20:01:22] standing by for someone to deploy mine. :) [20:01:56] I can deploy my own patch and robertsky's and whoever else needs me to deploy theirs (but after cscott has gone) [20:02:02] o/ [20:02:18] i can spiderpig my config patch (or folks can bundle it with other config patches) [20:02:31] maryum why don't you get started? config patch should be fast. [20:02:38] yes doing it now [20:02:46] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1271017/8416/contint1002.wikimedia.org/change.contint1002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [20:03:04] then i guess RoanKattouw is suggesting i should go next, and then he'll do his own patch and robertsky 's. [20:03:08] (03PS3) 10Dzahn: jenkins: allow disabling jenkins even on the manager host [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109) [20:04:33] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T423301#11821545 (10andrea.denisse) [20:05:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270571 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:05:37] spiderpig is in progress [20:06:34] can i just say again, as someone who's been intimately or tangentially involved with mediawiki deploys for 24 years, that spiderpig is *such* a wonderful democratizing tool <3 [20:06:47] +100, SpiderPig is amazing [20:06:49] (03Merged) 10jenkins-bot: Route email confirmation funnel through Test Kitchen experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270571 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:06:57] <3 the pig [20:07:18] !log mstyles@deploy1003 Started scap sync-world: Backport for [[gerrit:1270571|Route email confirmation funnel through Test Kitchen experiment (T420007)]] [20:07:22] T420007: Measurement plan: Email confirmation banner instrumentation - https://phabricator.wikimedia.org/T420007 [20:07:22] mm now i wanna get out my studio ghibli collection and watch porco rosso again [20:09:08] !log mstyles@deploy1003 mstyles: Backport for [[gerrit:1270571|Route email confirmation funnel through Test Kitchen experiment (T420007)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:12:52] !log mstyles@deploy1003 mstyles: Continuing with sync [20:16:29] (03PS1) 10Dzahn: gerrit: allow zuul machines to port 22 ssh (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1271042 [20:16:43] !log mstyles@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270571|Route email confirmation funnel through Test Kitchen experiment (T420007)]] (duration: 09m 25s) [20:16:47] T420007: Measurement plan: Email confirmation banner instrumentation - https://phabricator.wikimedia.org/T420007 [20:17:07] (03CR) 10CI reject: [V:04-1] gerrit: allow zuul machines to port 22 ssh (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (owner: 10Dzahn) [20:17:16] cscott I think you can go now [20:17:39] cool, thanks! [20:17:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271030 (https://phabricator.wikimedia.org/T422961) (owner: 10C. Scott Ananian) [20:17:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271038 (https://phabricator.wikimedia.org/T419328) (owner: 10C. Scott Ananian) [20:21:50] (03CR) 10Dzahn: [V:03+1] "it does what it is intended to do - the parameter naming could be confusing though - https://puppet-compiler.wmflabs.org/output/1271017/84" [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [20:22:30] (03Merged) 10jenkins-bot: ParsoidLanguageConverter: convert inside [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271030 (https://phabricator.wikimedia.org/T422961) (owner: 10C. Scott Ananian) [20:29:48] (03Merged) 10jenkins-bot: LanguageConverter: Allow disabling top-level variant "guess" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271038 (https://phabricator.wikimedia.org/T419328) (owner: 10C. Scott Ananian) [20:30:12] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1271030|ParsoidLanguageConverter: convert inside (T422961)]], [[gerrit:1271038|LanguageConverter: Allow disabling top-level variant "guess" (T419328)]] [20:30:18] T422961: LanguageConverter doesn't convert inside - https://phabricator.wikimedia.org/T422961 [20:30:18] T419328: Legacy LanguageConverter uses top-level ::guessVariant on srwiki - https://phabricator.wikimedia.org/T419328 [20:32:00] !log cscott@deploy1003 cscott: Backport for [[gerrit:1271030|ParsoidLanguageConverter: convert inside (T422961)]], [[gerrit:1271038|LanguageConverter: Allow disabling top-level variant "guess" (T419328)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:32:32] (03PS4) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [20:36:42] !log cscott@deploy1003 cscott: Continuing with sync [20:40:31] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271030|ParsoidLanguageConverter: convert inside (T422961)]], [[gerrit:1271038|LanguageConverter: Allow disabling top-level variant "guess" (T419328)]] (duration: 10m 18s) [20:40:35] T422961: LanguageConverter doesn't convert inside - https://phabricator.wikimedia.org/T422961 [20:40:36] T419328: Legacy LanguageConverter uses top-level ::guessVariant on srwiki - https://phabricator.wikimedia.org/T419328 [20:40:40] ok, over to you RoanKattouw [20:42:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky) [20:42:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271001 (https://phabricator.wikimedia.org/T423118) (owner: 10Catrope) [20:44:27] (03CR) 10CI reject: [V:04-1] Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky) [20:45:28] (03CR) 10Catrope: [C:03+2] Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky) [20:46:05] it's always fun when castor-save-workspace-cache fails. [20:46:32] (03Merged) 10jenkins-bot: Enforce 2FA requirements for phase 1 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271001 (https://phabricator.wikimedia.org/T423118) (owner: 10Catrope) [20:47:26] that feels like either T419488 or T409479 (depending on whether one's a duplicate of the other), and/or maybe some other task I'm not aware of [20:47:26] T419488: PostBuild changing the status of successful builds to failure for no apparent reason - https://phabricator.wikimedia.org/T419488 [20:47:27] T409479: quibble-with-gated-extensions-vendor-mysql-php81 failure due to postbuild jobs - https://phabricator.wikimedia.org/T409479 [20:47:36] (03Merged) 10jenkins-bot: Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky) [20:49:30] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1270933|Update wikimaniawiki namespace search (T423278)]], [[gerrit:1271001|Enforce 2FA requirements for phase 1 groups (T423118)]] [20:49:35] T423278: update search namespace to 2026, 2027 for wikimaniawiki - https://phabricator.wikimedia.org/T423278 [20:49:36] T423118: FY25-26 Q4: Phase 1 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423118 [20:51:22] !log catrope@deploy1003 catrope, robertsky: Backport for [[gerrit:1270933|Update wikimaniawiki namespace search (T423278)]], [[gerrit:1271001|Enforce 2FA requirements for phase 1 groups (T423118)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:53:09] !log catrope@deploy1003 catrope, robertsky: Continuing with sync [20:53:10] verified. [20:56:58] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270933|Update wikimaniawiki namespace search (T423278)]], [[gerrit:1271001|Enforce 2FA requirements for phase 1 groups (T423118)]] (duration: 07m 28s) [20:57:03] T423278: update search namespace to 2026, 2027 for wikimaniawiki - https://phabricator.wikimedia.org/T423278 [20:57:03] T423118: FY25-26 Q4: Phase 1 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423118 [20:57:10] Alright I'm done [20:57:21] bvibber: You're free to do yours nwo [20:58:45] thank you for the assistance! [20:59:46] whee [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T2100) [21:01:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270592 (https://phabricator.wikimedia.org/T423173) (owner: 10Bvibber) [21:01:16] it begins [21:01:25] shouldn't take long, it's just config <3 [21:02:28] (03Merged) 10jenkins-bot: Enable ReaderExperiments for itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270592 (https://phabricator.wikimedia.org/T423173) (owner: 10Bvibber) [21:02:51] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1270592|Enable ReaderExperiments for itwiki, plwiki (T423173)]] [21:02:55] T423173: Mobile Page Previews: Launch the experiment - https://phabricator.wikimedia.org/T423173 [21:04:41] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1270592|Enable ReaderExperiments for itwiki, plwiki (T423173)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:08:20] something's not right [21:08:48] !log bvibber@deploy1003 bvibber: Continuing with sync [21:08:53] nope it's right! [21:08:57] i was testing the wrong thing lol [21:12:39] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270592|Enable ReaderExperiments for itwiki, plwiki (T423173)]] (duration: 09m 48s) [21:12:43] T423173: Mobile Page Previews: Launch the experiment - https://phabricator.wikimedia.org/T423173 [21:14:06] PROBLEM - SSH on wikikube-worker2280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:14:18] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T423301#11821904 (10andrea.denisse) [21:16:13] oh yay mine's all done :D [21:16:20] anybody left or are we on to the next window :D [21:18:03] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T423301#11821907 (10andrea.denisse) [21:18:33] FIRING: KubernetesCalicoDown: wikikube-worker2280.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2280.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:22:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T423301#11821928 (10andrea.denisse) Hi @Passimacopoulos, while I work on your request I need to share with you the [[ https://wikitech.wikimedia.org/w... [21:24:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2280:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2280 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:39:18] PROBLEM - Host wikikube-worker2280 is DOWN: PING CRITICAL - Packet loss = 100% [21:39:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2280:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2280 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:41:48] (03CR) 10Jasmine: [C:03+2] Add wikikube-ctrl2004 and wikikube-ctrl2005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1218351 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [22:43:16] !log jasmine@dns1004 START - running authdns-update [22:44:44] !log jasmine@dns1004 END - running authdns-update [23:09:03] (03Abandoned) 10Andrew Bogott: nova vendordata: disable unattended upgrades in base image [puppet] - 10https://gerrit.wikimedia.org/r/1269056 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott) [23:11:28] !log optimizing globalblocks table on s7 (T423349) [23:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:31] T423349: globalblocks query randomly becomes slow - https://phabricator.wikimedia.org/T423349 [23:14:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 13.76% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:16:07] FIRING: [4x] ProbeDown: Service aqs1026-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:19:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 13.2% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:36:01] (03PS1) 10Eevans: linked-artifacts: update staging to v1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271083 (https://phabricator.wikimedia.org/T414838) [23:36:03] (03PS2) 10Cwhite: beta-logs: provision ca on cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1269509 (https://phabricator.wikimedia.org/T350516) [23:39:13] (03PS1) 10Ladsgroup: Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637) [23:39:26] (03PS1) 10Ladsgroup: Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271086 (https://phabricator.wikimedia.org/T412637) [23:39:35] (03PS3) 10Cwhite: beta-logs: provision ca on cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1269509 (https://phabricator.wikimedia.org/T350516) [23:39:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup) [23:39:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271086 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup) [23:39:45] (03PS4) 10Cwhite: beta-logs: provision ca on cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1269509 (https://phabricator.wikimedia.org/T350516) [23:39:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1271088 [23:39:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1271088 (owner: 10TrainBranchBot) [23:41:07] FIRING: [3x] ProbeDown: Service aqs1026-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:45:09] (03CR) 10Cwhite: [C:03+2] beta-logs: provision ca on cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1269509 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [23:54:11] (03Merged) 10jenkins-bot: Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271086 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup) [23:55:25] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:55:27] (03CR) 10CI reject: [V:04-1] Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup) [23:55:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:57:08] (03CR) 10Ladsgroup: "try again" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup) [23:57:30] (03CR) 10Ladsgroup: [C:03+2] Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup) [23:58:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1271088 (owner: 10TrainBranchBot)