[00:00:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:00:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:01:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:01:37] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:05:23] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: sync
[00:08:48] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: sync
[00:09:40] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool pool db1227: Work done
[00:20:40] <jinxer-wm>	 RESOLVED: ProbeDown: Service aqs1025-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1025-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:28:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:39:22] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2208: Work done
[00:51:30] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.remove-downtime for aqs1025.eqiad.wmnet
[00:51:31] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1025.eqiad.wmnet
[00:57:34] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1227: Work done
[01:09:06] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.24 [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270604 (https://phabricator.wikimedia.org/T420482)
[01:09:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.24 [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270604 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot)
[01:09:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270605
[01:09:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270605 (owner: 10TrainBranchBot)
[01:19:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.24 [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270604 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot)
[01:19:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270605 (owner: 10TrainBranchBot)
[01:50:59] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0200)
[02:09:16] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:16] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:40:59] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0300)
[03:01:53] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270614 (https://phabricator.wikimedia.org/T420482)
[03:01:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270614 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot)
[03:02:53] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270614 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot)
[03:03:16] <logmsgbot>	 !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.46.0-wmf.24  refs T420482
[03:03:20] <stashbot>	 T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482
[03:39:00] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.46.0-wmf.24  refs T420482 (duration: 35m 44s)
[03:39:04] <stashbot>	 T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482
[03:56:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ...
[03:56:51] <jinxer-wm>	 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[03:58:09] <jelto>	 !incidents
[03:58:09] <sirenbot>	 7833 (ACKED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw)
[03:58:27] <jelto>	 Let's see 
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0400)
[04:00:37] <godog>	 checking too
[04:02:04] <godog>	 jelto: codfw/ulsfo correct ?
[04:02:37] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.46.0-wmf.21 (duration: 02m 34s)
[04:04:03] <godog>	 looking into turnilo
[04:04:26] <jelto>	 godog: yes the transport to cr4-ulsfo 
[04:04:37] <jelto>	 I also look in superset
[04:04:51] <wikibugs>	 (03PS1) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1270617
[04:04:56] <godog>	 ok
[04:26:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ...
[04:26:51] <jinxer-wm>	 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[04:27:37] <jelto>	 !incidents
[04:27:37] <sirenbot>	 7833 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw)
[04:28:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:51:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ...
[04:51:51] <jinxer-wm>	 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[04:52:05] <jelto>	 !incidents
[04:52:05] <sirenbot>	 7834 (ACKED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw)
[04:52:05] <sirenbot>	 7833 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw)
[05:01:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ...
[05:01:51] <jinxer-wm>	 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[05:07:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:09:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:12:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:19:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: wiki-replicas: remove redundant grants [puppet] - 10https://gerrit.wikimedia.org/r/1270464 (https://phabricator.wikimedia.org/T422806) (owner: 10FNegri)
[05:20:02] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: wiki-replicas: add grants for %_maintain [puppet] - 10https://gerrit.wikimedia.org/r/1270465 (https://phabricator.wikimedia.org/T422806) (owner: 10FNegri)
[05:21:49] <wikibugs>	 (03PS1) 10Marostegui: installserver: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270758 (https://phabricator.wikimedia.org/T423151)
[05:25:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270758 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui)
[05:27:09] <wikibugs>	 (03PS1) 10Marostegui: db2217: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270760 (https://phabricator.wikimedia.org/T422777)
[05:27:26] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2217.codfw.wmnet with reason: Reimage to Trixie
[05:27:48] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2217: Reimage
[05:28:06] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2217: Reimage
[05:29:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2217: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270760 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui)
[05:30:18] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2217.codfw.wmnet with OS trixie
[05:35:24] <wikibugs>	 (03PS1) 10Marostegui: db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270761 (https://phabricator.wikimedia.org/T422777)
[05:35:34] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1180: Upgrade package
[05:35:49] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1180.eqiad.wmnet with reason: Reimage to Trixie
[05:36:06] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270761 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui)
[05:39:04] <wikibugs>	 (03PS1) 10Marostegui: eqiad.yaml: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270764 (https://phabricator.wikimedia.org/T423151)
[05:40:44] <wikibugs>	 (03PS1) 10Daniel Kinzler: API rate limits: add highlimits-user class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270765 (https://phabricator.wikimedia.org/T419796)
[05:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:41:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] API rate limits: add highlimits-user class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270765 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler)
[05:43:19] <wikibugs>	 (03PS2) 10Daniel Kinzler: API rate limits: add highlimits-user class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270765 (https://phabricator.wikimedia.org/T419796)
[05:46:04] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db1180: Upgrade package
[05:47:15] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1180.eqiad.wmnet with OS trixie
[05:49:07] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2217.codfw.wmnet with reason: host reimage
[05:52:47] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest-gateway: add some more IPs of large-scale NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270766
[05:54:19] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2217.codfw.wmnet with reason: host reimage
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0600).
[06:00:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Temporarily depool puppetserver1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/1270441 (owner: 10Muehlenhoff)
[06:00:57] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[06:02:13] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[06:02:40] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1180.eqiad.wmnet with reason: host reimage
[06:03:53] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest-gateway: try per-minute limits in shadow-mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270767
[06:04:55] <wikibugs>	 (03CR) 10Marostegui: "@taavi@wikimedia.org @fnegri@wikimedia.org is there anything needed after merging this to get the host removed from the LB?" [puppet] - 10https://gerrit.wikimedia.org/r/1270764 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui)
[06:08:30] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1180.eqiad.wmnet with reason: host reimage
[06:09:49] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2217: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270768
[06:11:10] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2217.codfw.wmnet with OS trixie
[06:11:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2217: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270768 (owner: 10Marostegui)
[06:12:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270769
[06:13:09] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2217: repool after reimage to trixie
[06:16:12] <wikibugs>	 (03PS5) 10Ryan Kemper: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[06:19:14] <wikibugs>	 (03PS6) 10Ryan Kemper: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[06:20:03] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[06:20:57] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[06:22:12] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[06:25:42] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[06:27:03] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[06:30:07] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply
[06:30:35] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1180.eqiad.wmnet with OS trixie
[06:30:39] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply
[06:31:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::striker: Remove separate monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/1270282 (owner: 10Majavah)
[06:33:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver2002.codfw.wmnet
[06:36:34] <wikibugs>	 10ops-eqiad, 06DC-Ops: Work on storage room cleanup - https://phabricator.wikimedia.org/T423227 (10VRiley-WMF) 03NEW
[06:38:31] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "SGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[06:39:27] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] prometheus::k8s: Ingest envoy cluster_update metrics [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm)
[06:40:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2002.codfw.wmnet
[06:40:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1002.eqiad.wmnet
[06:41:43] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "PCC looking good. I fixed a few issues:" [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[06:43:10] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes: Remove docker as supported container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1262154 (https://phabricator.wikimedia.org/T395870)
[06:45:13] <wikibugs>	 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11817793 (10JMeybohm)
[06:45:35] <wikibugs>	 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11817794 (10JMeybohm)
[06:47:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1002.eqiad.wmnet
[06:50:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Temporarily depool puppetserver1002/2002" [dns] - 10https://gerrit.wikimedia.org/r/1270770
[06:56:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Temporarily depool puppetserver1002/2002" [dns] - 10https://gerrit.wikimedia.org/r/1270770 (owner: 10Muehlenhoff)
[06:56:37] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[06:57:59] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[06:58:34] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2217: repool after reimage to trixie
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:04:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add a new Cumin alias to match hosts which are accessible via kerberized SSH [puppet] - 10https://gerrit.wikimedia.org/r/1270279 (owner: 10Muehlenhoff)
[07:04:59] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270769 (owner: 10Marostegui)
[07:05:35] <wikibugs>	 (03PS1) 10Mszwarc: Prepare $wgOATH2FARequiredGroupRemovalPages for next groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270772 (https://phabricator.wikimedia.org/T423118)
[07:06:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270772 (https://phabricator.wikimedia.org/T423118) (owner: 10Mszwarc)
[07:06:25] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1180: after upgrade
[07:07:12] <Msz2001>	 I see no deploys are going on, I'll proceed with deploying that change ^
[07:08:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270772 (https://phabricator.wikimedia.org/T423118) (owner: 10Mszwarc)
[07:08:32] <wikibugs>	 (03PS1) 10Marostegui: db2180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270773
[07:08:57] <wikibugs>	 (03Merged) 10jenkins-bot: Prepare $wgOATH2FARequiredGroupRemovalPages for next groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270772 (https://phabricator.wikimedia.org/T423118) (owner: 10Mszwarc)
[07:09:01] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2180.codfw.wmnet with reason: Reimage to Trixie
[07:09:36] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2180.codfw.wmnet with reason: Reimage to Trixie
[07:10:01] <logmsgbot>	 !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1270772|Prepare $wgOATH2FARequiredGroupRemovalPages for next groups (T423118)]]
[07:10:05] <stashbot>	 T423118: FY25-26 Q4: Phase 1 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423118
[07:11:29] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2180.codfw.wmnet with reason: Reimage to Trixie
[07:11:43] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2180.codfw.wmnet with reason: Reimage to Trixie
[07:11:48] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2180: Reimage to Trixie
[07:11:55] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270773 (owner: 10Marostegui)
[07:12:07] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2180: Reimage to Trixie
[07:14:02] <wikibugs>	 (03PS2) 10Brouberol: deployment_server: expand IPs behind the dumps-wikimedia external service [puppet] - 10https://gerrit.wikimedia.org/r/1270771
[07:14:23] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2180.codfw.wmnet with OS trixie
[07:15:40] <logmsgbot>	 !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1270772|Prepare $wgOATH2FARequiredGroupRemovalPages for next groups (T423118)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:15:44] <stashbot>	 T423118: FY25-26 Q4: Phase 1 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423118
[07:16:29] <logmsgbot>	 !log mszwarc@deploy1003 mszwarc: Continuing with sync
[07:22:38] <logmsgbot>	 !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270772|Prepare $wgOATH2FARequiredGroupRemovalPages for next groups (T423118)]] (duration: 12m 36s)
[07:22:41] <stashbot>	 T423118: FY25-26 Q4: Phase 1 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423118
[07:29:24] <wikibugs>	 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11817924 (10OKryva-WMF)
[07:32:16] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2180.codfw.wmnet with reason: host reimage
[07:32:19] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: Migrate section-specific DBA access rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff)
[07:35:56] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye
[07:36:11] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11817941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye
[07:36:27] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2068
[07:36:29] <wikibugs>	 (03PS1) 10Klausman: admin/klausman: remove non-YK SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1270780
[07:36:40] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[07:38:17] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2180.codfw.wmnet with reason: host reimage
[07:40:26] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2068 - mvernon@cumin2002"
[07:40:32] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2068 - mvernon@cumin2002"
[07:40:32] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:40:32] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2068.codfw.wmnet 91.32.192.10.in-addr.arpa 1.9.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[07:40:36] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2068.codfw.wmnet 91.32.192.10.in-addr.arpa 1.9.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[07:40:37] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2068
[07:41:29] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2068
[07:41:29] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2068
[07:46:59] <jinxer-wm>	 FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare
[07:49:36] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on pc2012.codfw.wmnet with reason: T419961
[07:49:57] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on pc1012.eqiad.wmnet with reason: T419961
[07:51:50] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1180: after upgrade
[07:54:20] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270813
[07:54:59] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270813 (owner: 10Marostegui)
[07:58:54] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[07:59:55] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11818041 (10Marostegui)
[08:00:05] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2180.codfw.wmnet with OS trixie
[08:00:22] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2180: repool after maintenance
[08:01:58] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc1012: T419961
[08:01:58] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[08:02:11] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[08:02:11] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1012: T419961
[08:02:26] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc1012: T419961
[08:02:26] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[08:02:34] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[08:02:34] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1012: T419961
[08:02:44] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc2012: T419961
[08:02:44] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[08:02:50] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[08:02:50] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2012: T419961
[08:04:20] <moritzm>	 !log installing libnginx-mod-http-lua security updates
[08:04:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:34] <wikibugs>	 (03PS1) 10Marostegui: db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270859
[08:05:10] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[08:05:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270859 (owner: 10Marostegui)
[08:05:43] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1168.eqiad.wmnet with reason: Reimage to Trixie
[08:05:45] <wikibugs>	 (03PS1) 10MVernon: preseed: move ms-be206[8,9] to new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1270860 (https://phabricator.wikimedia.org/T354872)
[08:05:48] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1168: Reimage to Trixie
[08:06:06] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1168: Reimage to Trixie
[08:07:42] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1168.eqiad.wmnet with OS trixie
[08:10:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] preseed: move ms-be206[8,9] to new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1270860 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[08:14:01] <wikibugs>	 (03CR) 10MVernon: [C:03+2] preseed: move ms-be206[8,9] to new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1270860 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[08:17:30] <wikibugs>	 (03CR) 10Ayounsi: [C:04-1] kubernetes-generic: Add alerts for BGP failure scenarios. (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[08:20:43] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2012: T419961
[08:20:44] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[08:20:57] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[08:20:57] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2012: T419961
[08:22:10] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1168.eqiad.wmnet with reason: host reimage
[08:23:01] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2068.codfw.wmnet with OS bullseye
[08:23:17] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11818181 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet...
[08:25:17] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2068
[08:25:34] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet
[08:25:43] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Excellent commit message. thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270559 (https://phabricator.wikimedia.org/T420696) (owner: 10Ryan Kemper)
[08:28:31] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[08:28:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:29:51] <wikibugs>	 (03CR) 10Brouberol: [C:04-1] "The bug was in the original module, and was fixed. The real fix here is to update the module" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper)
[08:31:55] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1168.eqiad.wmnet with reason: host reimage
[08:34:27] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2068.codfw.wmnet
[08:34:29] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet
[08:35:06] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: migrate gerrit_site away from root partition [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027)
[08:35:06] <wikibugs>	 (03CR) 10Arnaudb: "the pcc output is visible here: https://puppet-compiler.wmflabs.org/output/1270774/6384/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb)
[08:35:28] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: update sync-instances cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143)
[08:35:28] <wikibugs>	 (03CR) 10Arnaudb: "related to 1270774" [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb)
[08:37:46] <wikibugs>	 (03CR) 10Blake: [C:03+2] kubernetes: Remove docker as supported container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1262154 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm)
[08:38:44] <wikibugs>	 (03CR) 10Blake: [C:03+2] kubernetes: Remove docker related hiera settings from nodes [puppet] - 10https://gerrit.wikimedia.org/r/1260742 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm)
[08:40:20] <wikibugs>	 (03CR) 10Blake: [C:03+2] admin: add Blake's backup SSH key. [puppet] - 10https://gerrit.wikimedia.org/r/1270436 (owner: 10Blake)
[08:40:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177#11818235 (10MLechvien-WMF) p:05Triage→03Low
[08:42:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2190 System Configuration Check error - https://phabricator.wikimedia.org/T423175#11818240 (10MLechvien-WMF) p:05Triage→03Low
[08:43:47] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance
[08:43:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90618 and previous config saved to /var/cache/conftool/dbconfig/20260414-084353-fceratto.json
[08:43:57] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[08:45:46] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2180: repool after maintenance
[08:52:12] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270866
[08:52:51] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pki::intermediates: update debmonitor's public key [puppet] - 10https://gerrit.wikimedia.org/r/1270286 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[08:59:24] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "I think this patch is all we need." [puppet] - 10https://gerrit.wikimedia.org/r/1270764 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui)
[09:01:14] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90620 and previous config saved to /var/cache/conftool/dbconfig/20260414-090112-fceratto.json
[09:01:18] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:02:15] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good. One tiny nit in the comment." [puppet] - 10https://gerrit.wikimedia.org/r/1270771 (owner: 10Brouberol)
[09:03:20] <wikibugs>	 (03CR) 10Brouberol: deployment_server: expand IPs behind the dumps-wikimedia external service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1270771 (owner: 10Brouberol)
[09:09:27] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] eqiad.yaml: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270764 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui)
[09:09:29] <jinxer-wm>	 FIRING: ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:09:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:11:22] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P90621 and previous config saved to /var/cache/conftool/dbconfig/20260414-091122-fceratto.json
[09:12:38] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup[1006-1007,1014].eqiad.wmnet with reason: maintenance
[09:12:44] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11818387 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b67b68ad-79cc-40ba-b2d3-11ce2438694e) set by j...
[09:13:06] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:14:29] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:15:04] <wikibugs>	 (03PS11) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446)
[09:15:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[09:16:19] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] admin: Add second U2F key, remove non-U2F SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1269970 (owner: 10Federico Ceratto)
[09:17:49] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host backup1006
[09:19:24] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[09:21:31] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P90623 and previous config saved to /var/cache/conftool/dbconfig/20260414-092130-fceratto.json
[09:22:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:23:33] <wikibugs>	 (03PS12) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446)
[09:24:09] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host backup1006 - ayounsi@cumin1003"
[09:24:14] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host backup1006 - ayounsi@cumin1003"
[09:24:14] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:24:14] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache backup1006.eqiad.wmnet 162.32.64.10.in-addr.arpa 2.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:24:18] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) backup1006.eqiad.wmnet 162.32.64.10.in-addr.arpa 2.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:24:18] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host backup1006
[09:24:38] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1006
[09:24:38] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host backup1006
[09:24:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:24:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[09:25:49] <wikibugs>	 (03PS3) 10Brouberol: deployment_server: expand IPs behind the dumps-wikimedia external service [puppet] - 10https://gerrit.wikimedia.org/r/1270771
[09:27:48] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool db1151: Security updates
[09:27:49] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[09:27:51] <jinxer-wm>	 RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:27:57] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:27:57] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1151: Security updates
[09:28:23] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] deployment_server: expand IPs behind the dumps-wikimedia external service [puppet] - 10https://gerrit.wikimedia.org/r/1270771 (owner: 10Brouberol)
[09:29:23] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db2144: Test depool
[09:29:23] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[09:29:29] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:29:29] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2144: Test depool
[09:30:55] <wikibugs>	 (03PS1) 10Marostegui: check_private_data_report: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270868 (https://phabricator.wikimedia.org/T423151)
[09:31:39] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90625 and previous config saved to /var/cache/conftool/dbconfig/20260414-093138-fceratto.json
[09:31:42] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:31:50] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc2011: Test depool
[09:31:50] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[09:31:56] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance
[09:31:58] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99)
[09:31:58] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2011: Test depool
[09:32:05] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T419635)', diff saved to https://phabricator.wikimedia.org/P90627 and previous config saved to /var/cache/conftool/dbconfig/20260414-093204-fceratto.json
[09:32:15] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2011: Test depool
[09:32:15] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[09:32:19] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] check_private_data_report: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1270868 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui)
[09:32:29] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99)
[09:32:29] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2011: Test depool
[09:34:14] <icinga-wm>	 PROBLEM - MariaDB Replica IO: ms2 on db2144 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1151.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1151.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:35:03] <wikibugs>	 (03PS13) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446)
[09:35:18] <marostegui>	 federico3: ^* 
[09:35:50] <federico3>	 odd, the script should have silenced it
[09:37:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[09:38:32] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1151.eqiad.wmnet with reason: T419961
[09:39:29] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:40:11] <wikibugs>	 (03PS1) 10Elukey: debmonitor: use chained TLS cert for server and client [puppet] - 10https://gerrit.wikimedia.org/r/1270870 (https://phabricator.wikimedia.org/T420993)
[09:41:20] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: ms2 on db2144 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:41:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:43:25] <wikibugs>	 (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/output/1270870/8408/" [puppet] - 10https://gerrit.wikimedia.org/r/1270870 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[09:43:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1270870 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[09:44:02] <wikibugs>	 (03CR) 10Volans: [C:03+1] "ship it" [puppet] - 10https://gerrit.wikimedia.org/r/1270870 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[09:44:11] <wikibugs>	 (03CR) 10Elukey: [C:03+2] debmonitor: use chained TLS cert for server and client [puppet] - 10https://gerrit.wikimedia.org/r/1270870 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[09:44:23] <wikibugs>	 (03PS2) 10Federico Ceratto: sre.mysql.pool: Handle private tasks exception [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460)
[09:45:34] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2144.codfw.wmnet with reason: T419961
[09:46:03] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2011: Test depool
[09:46:03] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[09:46:13] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:46:13] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2011: Test depool
[09:46:29] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc2011: Test depool
[09:46:29] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[09:46:37] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:46:37] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2011: Test depool
[09:46:50] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2011: Test depool
[09:46:50] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[09:47:04] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:47:04] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2011: Test depool
[09:47:41] <wikibugs>	 (03CR) 10Federico Ceratto: "I updated it to support parsercache as well and tested that." [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto)
[09:49:27] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T419635)', diff saved to https://phabricator.wikimedia.org/P90631 and previous config saved to /var/cache/conftool/dbconfig/20260414-094926-fceratto.json
[09:49:29] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:49:31] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:50:40] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway: add some more IPs of large-scale NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270766 (owner: 10Daniel Kinzler)
[09:50:49] <wikibugs>	 (03CR) 10Marostegui: "So this is all tested and works?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto)
[09:52:35] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway: try per-minute limits in shadow-mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270767 (owner: 10Daniel Kinzler)
[09:56:54] <elukey>	 !log rotated debmonitor client and server certs fleetwide for intermediate certs rotation - T420993
[09:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:58] <stashbot>	 T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993
[09:57:59] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: try per-minute limits in shadow-mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270767 (owner: 10Daniel Kinzler)
[09:58:15] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: add some more IPs of large-scale NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270766 (owner: 10Daniel Kinzler)
[09:58:23] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: handle percent-escaped pipes in query params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270514 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[09:59:35] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P90632 and previous config saved to /var/cache/conftool/dbconfig/20260414-095934-fceratto.json
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1000)
[10:00:39] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: handle percent-escaped pipes in query params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270514 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[10:00:45] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: add some more IPs of large-scale NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270766 (owner: 10Daniel Kinzler)
[10:00:47] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: try per-minute limits in shadow-mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270767 (owner: 10Daniel Kinzler)
[10:02:58] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:03:20] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:04:03] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:04:36] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:06:16] <wikibugs>	 (03PS14) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446)
[10:06:50] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1168.eqiad.wmnet with OS trixie
[10:07:19] <wikibugs>	 (03PS1) 10STran: Deploy IRS to enwiki's Event Talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270872 (https://phabricator.wikimedia.org/T423042)
[10:07:26] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1168: after reimage to trixie
[10:08:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[10:08:47] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:09:17] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:09:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P90634 and previous config saved to /var/cache/conftool/dbconfig/20260414-100942-fceratto.json
[10:09:46] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[10:10:36] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1168: after reimage to trixie
[10:10:39] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[10:11:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Fully repool db1168', diff saved to https://phabricator.wikimedia.org/P90635 and previous config saved to /var/cache/conftool/dbconfig/20260414-101119-marostegui.json
[10:12:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270866 (owner: 10Marostegui)
[10:12:49] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa)
[10:13:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa)
[10:13:42] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet
[10:14:06] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:14:29] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Pool in', diff saved to https://phabricator.wikimedia.org/P90636 and previous config saved to /var/cache/conftool/dbconfig/20260414-101428-fceratto.json
[10:16:07] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:18:32] <wikibugs>	 (03PS1) 10Elukey: admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993)
[10:19:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet
[10:20:49] <wikibugs>	 (03CR) 10Anzx: Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa)
[10:21:00] <volans>	 !log install cumin v6.0.0 on cumin1003 (last host remained to upgrade)
[10:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:17] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:23:48] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:24:27] <wikibugs>	 (03PS2) 10STran: Deploy IRS to enwiki's Event Talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270872 (https://phabricator.wikimedia.org/T423042)
[10:24:52] <logmsgbot>	 !log volans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest1005.eqiad.wmnet with reason: Testing cumin v6.0.0
[10:25:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet
[10:34:28] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] Deploy IRS to enwiki's Event Talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270872 (https://phabricator.wikimedia.org/T423042) (owner: 10STran)
[10:34:37] <wikibugs>	 (03PS1) 10D3r1ck01: Remove temporary `wgOAuth2UsePrefixedSub` feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270882 (https://phabricator.wikimedia.org/T417690)
[10:36:32] <wikibugs>	 (03PS4) 10Arnaudb: gerrit: migrate gerrit_site away from root partition [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027)
[10:36:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:37:27] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb)
[10:39:56] <wikibugs>	 (03PS15) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446)
[10:41:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:44:26] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Diff looks good. I trust you on whether this is enough" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[10:44:53] <wikibugs>	 (03PS1) 10Muehlenhoff: debdeploy: Bump changelog for new release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1270883
[10:46:45] <jinxer-wm>	 FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:47:46] <wikibugs>	 (03PS4) 10Bodhisattwa: Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567
[10:49:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) (owner: 10Stang)
[10:51:01] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, I also like reducing the number of rsync server modules and bacula filesets" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb)
[10:51:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:53:29] <wikibugs>	 (03PS1) 10STran: Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042)
[10:53:41] <wikibugs>	 (03PS1) 10STran: Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270888 (https://phabricator.wikimedia.org/T423042)
[10:54:12] <wikibugs>	 (03CR) 10STran: "We want to deploy to enwiki on Wed so this needs to go in at the same time" [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) (owner: 10STran)
[10:54:13] <wikibugs>	 (03PS5) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804)
[10:54:22] <icinga-wm>	 RECOVERY - MariaDB Replica IO: ms2 on db2144 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:54:22] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: ms2 on db2144 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:54:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) (owner: 10STran)
[10:54:41] <wikibugs>	 (03PS4) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804)
[10:54:52] <wikibugs>	 (03PS16) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446)
[10:54:55] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1151: Security update
[10:54:55] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[10:54:57] <wikibugs>	 (03CR) 10JMeybohm: "Wouldn't it be enough to do this on one staging cluster rather then all of them?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[10:55:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270888 (https://phabricator.wikimedia.org/T423042) (owner: 10STran)
[10:55:05] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[10:55:05] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1151: Security update
[10:55:13] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[10:56:22] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool db1153: Security updates
[10:56:22] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[10:56:24] <wikibugs>	 (03CR) 10Clément Goubert: "@tklausmann@wikimedia.org @kbazira@wikimedia.org I'm tagging you on this as owners of the backend services, could you check that the URL p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[10:56:30] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[10:56:30] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1153: Security updates
[10:56:51] <wikibugs>	 (03CR) 10Clément Goubert: "@tklausmann@wikimedia.org @kbazira@wikimedia.org I'm tagging you on this as owners of the backend services, could you check that the URL p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[10:57:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[10:57:41] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa)
[10:57:44] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM, no uses since wmf.22:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150) (owner: 10Daimona Eaytoy)
[10:58:56] <wikibugs>	 (03CR) 10Anzx: [C:03+1] Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa)
[10:59:06] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[10:59:13] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance
[10:59:21] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90639 and previous config saved to /var/cache/conftool/dbconfig/20260414-105920-fceratto.json
[10:59:24] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:00:05] <wikibugs>	 (03PS17) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446)
[11:00:52] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye
[11:02:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[11:02:21] <wikibugs>	 (03CR) 10STran: "recheck" [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) (owner: 10STran)
[11:04:28] <icinga-wm>	 PROBLEM - MariaDB Replica IO: ms3 on db2143 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1153.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1153.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:04:33] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90640 and previous config saved to /var/cache/conftool/dbconfig/20260414-110432-fceratto.json
[11:04:36] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:07:28] <icinga-wm>	 RECOVERY - MariaDB Replica IO: ms3 on db2143 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:12:48] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[11:13:42] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[11:14:41] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P90641 and previous config saved to /var/cache/conftool/dbconfig/20260414-111440-fceratto.json
[11:15:02] <icinga-wm>	 PROBLEM - MariaDB Replica IO: ms3 on db1153 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2143.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2143.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:16:18] <wikibugs>	 (03PS1) 10FNegri: mariadb: wiki-replicas: add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1270891 (https://phabricator.wikimedia.org/T422806)
[11:16:48] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270892
[11:19:02] <icinga-wm>	 RECOVERY - MariaDB Replica IO: ms3 on db1153 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:21:10] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 553.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:22:39] <wikibugs>	 (03PS2) 10FNegri: mariadb: wiki-replicas: add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1270891 (https://phabricator.wikimedia.org/T422806)
[11:24:05] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.pool pool db1153: Security updates
[11:24:05] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[11:24:18] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[11:24:18] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1153: Security updates
[11:24:49] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P90643 and previous config saved to /var/cache/conftool/dbconfig/20260414-112448-fceratto.json
[11:26:14] <logmsgbot>	 mvernon@cumin2002 convert-disks (PID 3035179) is awaiting input
[11:26:27] <wikibugs>	 (03PS18) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446)
[11:33:55] <wikibugs>	 (03PS1) 10Matthias Mullie: Enable Extension:ReaderExperiments on itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270897 (https://phabricator.wikimedia.org/T423173)
[11:34:57] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T419635)', diff saved to https://phabricator.wikimedia.org/P90644 and previous config saved to /var/cache/conftool/dbconfig/20260414-113456-fceratto.json
[11:35:01] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:35:03] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance
[11:35:07] <wikibugs>	 (03Abandoned) 10Matthias Mullie: Enable Extension:ReaderExperiments on itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270897 (https://phabricator.wikimedia.org/T423173) (owner: 10Matthias Mullie)
[11:35:11] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T419635)', diff saved to https://phabricator.wikimedia.org/P90645 and previous config saved to /var/cache/conftool/dbconfig/20260414-113510-fceratto.json
[11:37:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T419635)', diff saved to https://phabricator.wikimedia.org/P90646 and previous config saved to /var/cache/conftool/dbconfig/20260414-113721-fceratto.json
[11:37:37] <wikibugs>	 (03PS1) 10MVernon: partman: also add ms-be206[8-9] to partman_early_command [puppet] - 10https://gerrit.wikimedia.org/r/1270899 (https://phabricator.wikimedia.org/T354872)
[11:39:35] <wikibugs>	 (03PS1) 10Daniel Kinzler: redioscope: capture rate limit window duration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270901
[11:41:31] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes: Remove absent rsyslog config: block-docker-mount-spam [puppet] - 10https://gerrit.wikimedia.org/r/1270903 (https://phabricator.wikimedia.org/T395870)
[11:42:21] <wikibugs>	 (03CR) 10Slyngshede: P:tofurkey Add tofurkey (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[11:45:59] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool esams [reason: network maintenance, T416450]
[11:46:03] <stashbot>	 T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
[11:46:22] <wikibugs>	 (03CR) 10Klausman: [C:03+1] "Added a few minor thoughts, nothing that really needs addressing right now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[11:46:33] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool esams [reason: network maintenance, T416450]
[11:47:01] <logmsgbot>	 !log ayounsi@cumin1003 conftool action : set/pooled=no; selector: cluster=dnsbox,dc=esams [reason: esams maintenance over]
[11:47:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P90647 and previous config saved to /var/cache/conftool/dbconfig/20260414-114732-fceratto.json
[11:47:45] <Dreamy_Jazz>	 jouncebot: nowandnext
[11:47:46] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 12 minute(s)
[11:47:46] <jouncebot>	 In 0 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1200)
[11:48:24] <wikibugs>	 (03PS1) 10STran: Update webonyx/graphql-php to 15.31.5 [vendor] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270902 (https://phabricator.wikimedia.org/T423216)
[11:49:54] <wikibugs>	 (03PS1) 10STran: Update webonyx/graphql-php to 15.31.5 [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270905 (https://phabricator.wikimedia.org/T423216)
[11:52:08] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] kubernetes: Remove absent rsyslog config: block-docker-mount-spam [puppet] - 10https://gerrit.wikimedia.org/r/1270903 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm)
[11:53:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270448 (owner: 10Muehlenhoff)
[11:54:39] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply
[11:55:30] <wikibugs>	 (03CR) 10Klausman: amg-gpu: Set up explicit GPU partitioning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski)
[11:55:36] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply
[11:56:40] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "The glob matches the description." [puppet] - 10https://gerrit.wikimedia.org/r/1270899 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[11:57:25] <logmsgbot>	 !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr1-esams,cr1-esams IPv6,re0.cr1-esams.mgmt with reason: router upgrade
[11:57:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P90648 and previous config saved to /var/cache/conftool/dbconfig/20260414-115739-fceratto.json
[11:57:45] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance
[11:57:53] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T410589)', diff saved to https://phabricator.wikimedia.org/P90649 and previous config saved to /var/cache/conftool/dbconfig/20260414-115752-ladsgroup.json
[11:57:59] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[11:58:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "nice work, see inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[11:58:26] <wikibugs>	 (03CR) 10MVernon: [C:03+2] partman: also add ms-be206[8-9] to partman_early_command [puppet] - 10https://gerrit.wikimedia.org/r/1270899 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1200)
[12:01:04] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply
[12:01:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11818956 (10klausman) I think we can run this machine one a single disk until its replacement arrives. Even if it dies entirely, we have enough serving capacity in eqiad to handl...
[12:01:52] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[12:02:01] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T410589)', diff saved to https://phabricator.wikimedia.org/P90650 and previous config saved to /var/cache/conftool/dbconfig/20260414-120200-ladsgroup.json
[12:02:16] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply
[12:02:17] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool db1151: Security updates
[12:02:18] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[12:02:26] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[12:02:26] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1151: Security updates
[12:03:03] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply
[12:05:48] <wikibugs>	 (03Abandoned) 10Tchanders: Add Special:GlobalContributions to no-IP reveal pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530) (owner: 10Tchanders)
[12:05:59] <wikibugs>	 (03CR) 10Klausman: [C:03+1] rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[12:06:18] <wikibugs>	 (03PS19) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446)
[12:07:48] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T419635)', diff saved to https://phabricator.wikimedia.org/P90652 and previous config saved to /var/cache/conftool/dbconfig/20260414-120747-fceratto.json
[12:07:49] <wikibugs>	 (03CR) 10Klausman: [C:03+1] "LGTM, with one note: The API GW/Envoy does not do path normalization, hence the `(:|%3A|%3a)` regex. _If_ normaliztion is done in the new " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[12:07:52] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:08:05] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance
[12:08:13] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T419635)', diff saved to https://phabricator.wikimedia.org/P90653 and previous config saved to /var/cache/conftool/dbconfig/20260414-120812-fceratto.json
[12:08:58] <icinga-wm>	 PROBLEM - MariaDB Replica IO: ms2 on db2144 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1151.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1151.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:11:58] <icinga-wm>	 RECOVERY - MariaDB Replica IO: ms2 on db2144 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:12:56] <icinga-wm>	 PROBLEM - Host ganeti6004 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:56] <icinga-wm>	 PROBLEM - Host durum6001 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:56] <icinga-wm>	 PROBLEM - Host ganeti6002 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:56] <icinga-wm>	 PROBLEM - Host durum6002 is DOWN: PING CRITICAL - Packet loss = 100%
[12:13:08] <icinga-wm>	 RECOVERY - Host ganeti6002 is UP: PING OK - Packet loss = 0%, RTA = 87.51 ms
[12:13:24] <icinga-wm>	 RECOVERY - Host ganeti6004 is UP: PING OK - Packet loss = 0%, RTA = 89.09 ms
[12:13:24] <icinga-wm>	 RECOVERY - Host durum6001 is UP: PING OK - Packet loss = 0%, RTA = 89.36 ms
[12:13:24] <icinga-wm>	 RECOVERY - Host durum6002 is UP: PING OK - Packet loss = 0%, RTA = 89.32 ms
[12:13:30] <wikibugs>	 (03CR) 10Mszwarc: Update webonyx/graphql-php to 15.31.5 (033 comments) [vendor] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270902 (https://phabricator.wikimedia.org/T423216) (owner: 10STran)
[12:14:43] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[12:14:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:14:52] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:15:48] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:15:52] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:16:07] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: update sync-instances cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143)
[12:16:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:16:13] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[12:16:22] <volans>	 !ack
[12:16:22] <sirenbot>	 7836 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule@main)
[12:16:39] <jinxer-wm>	 FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:17:07] <volans>	 XioNoX: all expected? cr1-drmrs to cr2-eqiad seems unrelated
[12:17:08] <XioNoX>	 best time for a drmrs transport link failure...
[12:17:16] <volans>	 sigh
[12:17:21] <claime>	 Ugh
[12:17:37] <XioNoX>	 let's repool esams, I haven't rebooted any router yet for upgrade, just prepared things
[12:17:46] <volans>	 sounds good, thx
[12:17:52] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool esams [reason: network maintenance paused, T416450]
[12:17:53] <claime>	 Ack. Ping if you need help
[12:17:54] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool esams [reason: network maintenance paused, T416450]
[12:17:55] <stashbot>	 T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
[12:18:14] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] "As said on IRC: These probably need to go into a new file with:" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[12:18:24] <volans>	 we've lost just redundancy on drmrs right? why the page?
[12:18:52] <volans>	 I see all back to 100% on the ha proxy graph, I guess just reconciliation time
[12:18:56] <icinga-wm>	 PROBLEM - MariaDB Replica IO: ms2 on db1151 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2144.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2144.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:19:15] <XioNoX>	 volans: yeah quick blip during the network convergence I guess
[12:19:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11819173 (10MoritzMuehlenhoff)
[12:19:50] <XioNoX>	 looks like it's back up?
[12:19:52] <volans>	 do you want us to do anything for drmrs or are you taking care of it?
[12:20:19] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] Update webonyx/graphql-php to 15.31.5 [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270905 (https://phabricator.wikimedia.org/T423216) (owner: 10STran)
[12:20:21] <XioNoX>	 I'm having a look at the transport link, but I'll let you look at the HAproxy alert
[12:20:54] <volans>	 haproxy is back to normal AFAICS
[12:21:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:21:13] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[12:21:39] <jinxer-wm>	 RESOLVED: [3x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:21:44] <XioNoX>	 yeah... just a blip
[12:21:53] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] kubernetes: Remove absent rsyslog config: block-docker-mount-spam [puppet] - 10https://gerrit.wikimedia.org/r/1270903 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm)
[12:21:54] <XioNoX>	 I guess I'll re-depool esams....
[12:21:58] <volans>	 with perfect timing
[12:22:06] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270912
[12:22:12] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool esams [reason: network maintenance continue, T416450]
[12:22:14] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool esams [reason: network maintenance continue, T416450]
[12:22:55] <XioNoX>	 yeah, just annoying, we're < 8Gbps of transport traffic with esams+drmrs, so it's "just" annoying, no real user impact
[12:22:56] <icinga-wm>	 RECOVERY - MariaDB Replica IO: ms2 on db1151 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:24:09] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[12:24:32] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270912 (owner: 10PipelineBot)
[12:26:04] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T419635)', diff saved to https://phabricator.wikimedia.org/P90654 and previous config saved to /var/cache/conftool/dbconfig/20260414-122603-fceratto.json
[12:26:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11819211 (10MoritzMuehlenhoff)
[12:26:07] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:26:36] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270912 (owner: 10PipelineBot)
[12:27:20] <logmsgbot>	 !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[12:28:05] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.pool pool db1151: Security updates
[12:28:05] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[12:28:18] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[12:28:18] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1151: Security updates
[12:28:21] <logmsgbot>	 !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[12:28:40] <wikibugs>	 (03PS1) 10Dreamy Jazz: [WIP] Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252)
[12:28:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:32:33] <logmsgbot>	 !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[12:33:35] <logmsgbot>	 !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[12:33:52] <XioNoX>	 !log cr1-esams - request chassis routing-engine master switch - T416450
[12:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:55] <stashbot>	 T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
[12:34:41] <logmsgbot>	 !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[12:34:59] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine)
[12:35:16] <logmsgbot>	 !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[12:35:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270905 (https://phabricator.wikimedia.org/T423216) (owner: 10STran)
[12:35:43] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[12:36:07] <wikibugs>	 (03Abandoned) 10STran: Update webonyx/graphql-php to 15.31.5 [vendor] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270902 (https://phabricator.wikimedia.org/T423216) (owner: 10STran)
[12:36:12] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P90656 and previous config saved to /var/cache/conftool/dbconfig/20260414-123611-fceratto.json
[12:36:32] <wikibugs>	 (03CR) 10JMeybohm: service::catalog: add sophroid service catalog entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine)
[12:36:34] <wikibugs>	 (03Abandoned) 10STran: Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) (owner: 10STran)
[12:37:51] <jinxer-wm>	 FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:37:54] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] rest-gateway: Add liftwing listeners and network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[12:38:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:38:39] <jinxer-wm>	 FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:38:51] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[12:39:44] <XioNoX>	 interfaces are slowly coming back up
[12:41:29] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[12:41:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:41:52] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:42:51] <jinxer-wm>	 FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:42:59] <XioNoX>	 cr1-esams is backup, keeping an eye on drmrs
[12:43:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:43:14] <volans>	 is a backup or is back up? :-P
[12:43:39] <jinxer-wm>	 FIRING: [9x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:43:51] <jinxer-wm>	 RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[12:43:52] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:44:19] <XioNoX>	 volans: back up on the backup re :)   now pushing the upgrade on the primary RE, then will do another (and last) RE-switch
[12:44:29] <volans>	 :D
[12:44:48] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:46:16] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Looks plausible" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[12:46:20] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P90657 and previous config saved to /var/cache/conftool/dbconfig/20260414-124620-fceratto.json
[12:46:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:47:33] <wikibugs>	 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11819359 (10Ladsgroup) Okay, after four tries (!) we got envoy to work. Now...
[12:47:35] <XioNoX>	 while I think about it, as esams is running routed ganeti, can we move all the VMs from one rack to the other before I do the switch reboot to reduce the impact of that switch reboot?
[12:47:41] <XioNoX>	 moritzm: ^
[12:47:51] <jinxer-wm>	 RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:48:10] <jinxer-wm>	 RESOLVED: [6x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:48:39] <jinxer-wm>	 RESOLVED: [12x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:48:54] <wikibugs>	 (03PS1) 10Dreamy Jazz: [WIP] Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252)
[12:48:57] <Dreamy_Jazz>	 jouncebot: nowandnext
[12:48:58] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1200)
[12:48:58] <jouncebot>	 In 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1300)
[12:49:19] <wikibugs>	 (03PS2) 10Dreamy Jazz: Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252)
[12:49:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) (owner: 10Dreamy Jazz)
[12:49:42] <wikibugs>	 (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270447 (owner: 10PipelineBot)
[12:49:47] <wikibugs>	 (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270600 (owner: 10PipelineBot)
[12:49:52] <wikibugs>	 (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270892 (owner: 10PipelineBot)
[12:50:20] <moritzm>	 XioNoX: which virt nodes need to be emptied?
[12:50:56] <wikibugs>	 (03PS3) 10Dreamy Jazz: Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252)
[12:51:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:52:12] <XioNoX>	 moritzm: 3006/3008 first (rack BW27) then 3005/3007 (rack By27)
[12:52:19] <XioNoX>	 or the other way around, doesn't matter :)
[12:53:53] <moritzm>	 we can do that, but isn't esams going to be depooled anyway?
[12:54:19] <wikibugs>	 (03PS4) 10Dreamy Jazz: Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252)
[12:54:48] <XioNoX>	 moritzm: the site is already depooled, so, like we did for drmrs, we can jsut downtime the hosts/vms and reboot the switches.
[12:55:01] <XioNoX>	 but I'm wondering if it's not cleaner to migrate them
[12:55:12] <XioNoX>	 no strong opinion here
[12:56:28] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T419635)', diff saved to https://phabricator.wikimedia.org/P90658 and previous config saved to /var/cache/conftool/dbconfig/20260414-125628-fceratto.json
[12:56:34] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance
[12:56:37] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:56:42] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T419635)', diff saved to https://phabricator.wikimedia.org/P90659 and previous config saved to /var/cache/conftool/dbconfig/20260414-125642-fceratto.json
[12:58:46] <icinga-wm>	 PROBLEM - Host re0.cr1-esams.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[12:59:17] <wikibugs>	 (03PS2) 10Elukey: admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993)
[12:59:32] <moritzm>	 XioNoX: the current cookbook only drains one node at a time, so w/o manual intervention we can't move all VMs to e.g. 3006/3008, after the first is drained, draining th second would also select nodes on the first node again
[12:59:39] <wikibugs>	 (03CR) 10Elukey: "Updated!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[12:59:46] <XioNoX>	 I see, ok
[12:59:48] <moritzm>	 so I'd say let's downtime and reboot then
[12:59:50] <logmsgbot>	 !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr1-esams,cr1-esams IPv6,re0.cr1-esams.mgmt with reason: router upgrade
[12:59:58] <XioNoX>	 sounds good
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1300)
[13:00:05] <jouncebot>	 Daimona, kipfel, Tran, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:08] <Dreamy_Jazz>	 \o
[13:00:12] <kipfel>	 o/
[13:00:13] <Tran>	 o/
[13:00:29] <Lucas_WMDE>	 o/ I could deploy in 10 minutes or so if needed
[13:00:38] <Dreamy_Jazz>	 (Depending on how long others take, I may need to do mine later as I need to go before the end of the window)
[13:01:06] <XioNoX>	 !log cr1-esams - request chassis routing-engine master switch - T416450
[13:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:09] <stashbot>	 T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
[13:01:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Depool puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1270924
[13:01:45] <Dreamy_Jazz>	 Daimona: You around (your first)?
[13:01:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:01:52] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:01:59] <wikibugs>	 (03PS1) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509)
[13:02:28] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1270924 (owner: 10Muehlenhoff)
[13:02:43] <wikibugs>	 (03PS1) 10Marostegui: db2214: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270926
[13:03:05] <wikibugs>	 (03PS1) 10Arnaudb: mailman: test httpd config before reloading [puppet] - 10https://gerrit.wikimedia.org/r/1270921 (https://phabricator.wikimedia.org/T323208)
[13:03:25] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:03:30] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2214.codfw.wmnet with reason: Reimage to Trixie
[13:03:36] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2214: Reimage to Trixie
[13:03:39] <jinxer-wm>	 FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:03:42] <Dreamy_Jazz>	 Looks like they are not here
[13:03:48] <icinga-wm>	 RECOVERY - Host re0.cr1-esams.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.69 ms
[13:03:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2214: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270926 (owner: 10Marostegui)
[13:03:54] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:03:54] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2214: Reimage to Trixie
[13:04:19] <Daimona>	 Dreamy_Jazz: yup, a bit late but around, sorry!
[13:04:40] <jinxer-wm>	 FIRING: [7x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:04:51] <jinxer-wm>	 FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:04:51] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[13:05:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Depool puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1270924 (owner: 10Muehlenhoff)
[13:05:25] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[13:05:43] <Dreamy_Jazz>	 Daimona: You self deploying or need someone?
[13:05:48] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:05:52] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:06:00] <Dreamy_Jazz>	 (I forget who has server access)
[13:06:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:06:25] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2214.codfw.wmnet with OS trixie
[13:06:34] <Daimona>	 I'm not a deployer so I'd need help :)
[13:06:37] <wikibugs>	 (03PS1) 10Elukey: Update the systemd units to wait for udev before starting [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1270927
[13:06:47] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[13:06:48] <XioNoX>	 cr1-esams is back up
[13:07:02] <volans>	 yay
[13:07:15] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Stop setting $wgCampaignEventsEnableEventGoals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150) (owner: 10Daimona Eaytoy)
[13:07:42] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Revert "zhwiki: Temporary Logo Change for WP25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) (owner: 10Stang)
[13:08:12] <wikibugs>	 (03Merged) 10jenkins-bot: Stop setting $wgCampaignEventsEnableEventGoals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150) (owner: 10Daimona Eaytoy)
[13:08:25] <jinxer-wm>	 FIRING: [7x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:08:27] <Dreamy_Jazz>	 Tran: Do you mind if I do mine first before yours?
[13:08:30] <Dreamy_Jazz>	 I'll deploy the others
[13:08:33] <Tran>	 Sure
[13:08:34] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "zhwiki: Temporary Logo Change for WP25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) (owner: 10Stang)
[13:08:39] <jinxer-wm>	 FIRING: [12x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:08:40] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[13:08:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) (owner: 10Dreamy Jazz)
[13:09:00] <wikibugs>	 (03PS1) 10Btullis: Prepare dse-k8s-ctrl servers for ipip migration [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437)
[13:09:09] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis)
[13:09:40] <jinxer-wm>	 RESOLVED: [7x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:09:51] <jinxer-wm>	 RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:09:51] <jinxer-wm>	 RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[13:09:54] <logmsgbot>	 !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr2-esams,cr2-esams IPv6,cr2-esams.mgmt with reason: router upgrade
[13:09:58] <wikibugs>	 (03Merged) 10jenkins-bot: Enable VisualEditor hCaptcha on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270918 (https://phabricator.wikimedia.org/T423252) (owner: 10Dreamy Jazz)
[13:10:23] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2214: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270930
[13:10:26] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1270490|Stop setting $wgCampaignEventsEnableEventGoals (T414150)]], [[gerrit:1268293|Revert "zhwiki: Temporary Logo Change for WP25" (T414299)]], [[gerrit:1270918|Enable VisualEditor hCaptcha on testwiki (T423252)]]
[13:10:33] <stashbot>	 T414150: Drop feature flag for event goals - https://phabricator.wikimedia.org/T414150
[13:10:33] <stashbot>	 T414299: Requesting temporary logo change for zh.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414299
[13:10:34] <stashbot>	 T423252: Enable VisualEditor hCaptcha on testwiki - https://phabricator.wikimedia.org/T423252
[13:10:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282 (10MoritzMuehlenhoff) 03NEW
[13:10:51] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270873 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[13:11:23] <wikibugs>	 (03PS1) 10Ladsgroup: envoy: Add 1 retry for swift services [puppet] - 10https://gerrit.wikimedia.org/r/1270931 (https://phabricator.wikimedia.org/T328872)
[13:12:08] <logmsgbot>	 !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2068.codfw.wmnet with OS bullseye
[13:12:08] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2068
[13:12:20] <logmsgbot>	 !log dreamyjazz@deploy1003 daimona, stang, dreamyjazz: Backport for [[gerrit:1270490|Stop setting $wgCampaignEventsEnableEventGoals (T414150)]], [[gerrit:1268293|Revert "zhwiki: Temporary Logo Change for WP25" (T414299)]], [[gerrit:1270918|Enable VisualEditor hCaptcha on testwiki (T423252)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:12:22] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[13:12:36] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[13:12:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11819562 (10Volans) Some execption example from the puppetserver logs (cut out as they are pretty long):  ` 2026-04-14T12:57:10.834Z ERROR [qtp1196799668-113293]...
[13:12:41] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye
[13:12:43] * Lucas_WMDE is here now
[13:12:49] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11819563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye
[13:12:59] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[13:13:06] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[13:13:11] <kipfel>	 Dreamy_Jazz, the logo reverted as expected, LGTM
[13:13:11] <wikibugs>	 (03PS14) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877)
[13:13:23] <wikibugs>	 (03CR) 10Blake: "Acknowledged" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[13:13:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:13:27] <wikibugs>	 (03PS1) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1270932 (https://phabricator.wikimedia.org/T355446)
[13:13:27] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T419635)', diff saved to https://phabricator.wikimedia.org/P90661 and previous config saved to /var/cache/conftool/dbconfig/20260414-131326-fceratto.json
[13:13:30] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:13:32] <wikibugs>	 (03PS1) 10Robertsky: Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278)
[13:13:39] <jinxer-wm>	 RESOLVED: [10x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:14:06] <wikibugs>	 (03Abandoned) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1270932 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[13:14:10] <elukey>	 !log disable cert-renewal on wikikube staging clusters as a test for the PKI discovery intermediate rollout - To rollback, revert: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1270873 - T420993
[13:14:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:13] <stashbot>	 T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993
[13:14:37] <Dreamy_Jazz>	 Diamona: Any testing to be done?
[13:14:46] <Dreamy_Jazz>	 daimona:
[13:14:52] <Dreamy_Jazz>	 Daimona:
[13:15:06] <Daimona>	 I'll take a quick look
[13:15:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: wiki-replicas: add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1270891 (https://phabricator.wikimedia.org/T422806) (owner: 10FNegri)
[13:15:14] <icinga-wm>	 RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Tue 12 May 2026 12:40:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[13:15:16] <Dreamy_Jazz>	 My testing is done
[13:15:34] <XioNoX>	 !log cr2-esams - request vmhost reboot - T416450
[13:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:37] <stashbot>	 T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
[13:15:52] <Daimona>	 Looks good, thank you
[13:15:57] <Dreamy_Jazz>	 Proceeding
[13:16:03] <logmsgbot>	 !log dreamyjazz@deploy1003 daimona, stang, dreamyjazz: Continuing with sync
[13:16:12] <wikibugs>	 (03PS20) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446)
[13:17:26] <wikibugs>	 (03CR) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[13:19:51] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:19:51] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:19:53] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270490|Stop setting $wgCampaignEventsEnableEventGoals (T414150)]], [[gerrit:1268293|Revert "zhwiki: Temporary Logo Change for WP25" (T414299)]], [[gerrit:1270918|Enable VisualEditor hCaptcha on testwiki (T423252)]] (duration: 09m 27s)
[13:19:55] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[13:20:00] <stashbot>	 T414150: Drop feature flag for event goals - https://phabricator.wikimedia.org/T414150
[13:20:00] <stashbot>	 T414299: Requesting temporary logo change for zh.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414299
[13:20:00] <stashbot>	 T423252: Enable VisualEditor hCaptcha on testwiki - https://phabricator.wikimedia.org/T423252
[13:20:05] <Dreamy_Jazz>	 Tran: Over to you
[13:20:11] <Tran>	 Thanks
[13:21:49] <Tran>	 I can self deploy mine, but if anyone has any opinions on https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/1270905 I would appreciate a third pair of eyes. It's a version bump/fix that is blocking another backport I want to do tomorrow so I'm not very familiar with it but it looked safe.
[13:21:51] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/50 (Core: cr2-esams:et-0/0/0 {#30369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[13:23:25] <jinxer-wm>	 FIRING: [9x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:23:28] <Amir1>	 !log  on testcommonswiki drop table if exists categorylinks; drop table if exists externallinks; drop table if exists linktarget; drop table if exists collation; drop table if exists imagelinks; drop table if exists iwlinks; drop table if exists existencelinks; (T421914)
[13:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:35] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P90662 and previous config saved to /var/cache/conftool/dbconfig/20260414-132334-fceratto.json
[13:23:37] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[13:23:54] <jinxer-wm>	 FIRING: [16x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:24:16] * Lucas_WMDE is amazed to see testcommonswiki rise from the dead
[13:24:29] <Lucas_WMDE>	 Tran: I had a look at the change but I don’t understand the library well enough to really review it…
[13:25:18] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[13:26:33] <Amir1>	 :D
[13:26:37] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2214.codfw.wmnet with reason: host reimage
[13:26:41] <Lucas_WMDE>	 but I think I would say go ahead?
[13:26:50] <Tran>	 Lucas_WMDE: From what I can tell, it shouldn't affect any of our current usages and didn't force any follow ups. This will go out in the next branch cut anyway so I think it'll be ok?
[13:26:53] <Tran>	 k going
[13:27:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270905 (https://phabricator.wikimedia.org/T423216) (owner: 10STran)
[13:27:17] <wikibugs>	 (03CR) 10Chlod Alejandro: [C:03+1] Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky)
[13:27:52] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:27:52] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:28:09] <XioNoX>	 cr2-esams is back up
[13:28:30] <jinxer-wm>	 FIRING: [2x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-bw27-esams.mgmt.esams.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[13:29:09] <jinxer-wm>	 FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr2-esams (185.15.59.158) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:30:40] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:30:45] <wikibugs>	 (03PS2) 10Robertsky: Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278)
[13:31:03] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2214.codfw.wmnet with reason: host reimage
[13:31:13] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[13:31:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:31:51] <jinxer-wm>	 RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/50 (Core: cr2-esams:et-0/0/0 {#30369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[13:33:35] <logmsgbot>	 !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: router upgrade
[13:33:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P90663 and previous config saved to /var/cache/conftool/dbconfig/20260414-133342-fceratto.json
[13:33:46] <wikibugs>	 (03PS2) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509)
[13:33:54] <jinxer-wm>	 RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr2-esams (185.15.59.158) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:34:21] <XioNoX>	 I'm going to upgrade asw1-bw27, so that's going to bring host offline, hosts are depooled
[13:35:23] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[13:35:37] <logmsgbot>	 !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw1-bw27-esams,asw1-bw27-esams IPv6,asw1-bw27-esams.mgmt with reason: router upgrade
[13:36:07] <XioNoX>	 !log asw1-bw27-esams> request system reboot - T416450
[13:36:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:11] <stashbot>	 T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
[13:36:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:36:49] <wikibugs>	 (03PS3) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509)
[13:37:50] <XioNoX>	 volans, claime, any idea what's up with this puppet failure in drmrs?
[13:37:59] <wikibugs>	 (03Merged) 10jenkins-bot: Update webonyx/graphql-php to 15.31.5 [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270905 (https://phabricator.wikimedia.org/T423216) (owner: 10STran)
[13:38:21] <volans>	 XioNoX: if they were related to puppetserver1002 it's recovering as it's been depooled
[13:38:24] <volans>	 see -sre
[13:38:25] <logmsgbot>	 !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1270905|Update webonyx/graphql-php to 15.31.5 (T423216)]]
[13:38:28] <stashbot>	 T423216: Update webonyx/graphql-php to 15.31.5 - https://phabricator.wikimedia.org/T423216
[13:38:43] <wikibugs>	 (03CR) 10Clément Goubert: "@mvernon@wikimedia.org was a little worried about retries when we are overwhelmed, I think 1 retry is ok" [puppet] - 10https://gerrit.wikimedia.org/r/1270931 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[13:39:14] <icinga-wm>	 PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:16] <icinga-wm>	 PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:18] <icinga-wm>	 PROBLEM - Host durum3005 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:18] <icinga-wm>	 PROBLEM - Host tcp-proxy3001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:18] <icinga-wm>	 PROBLEM - Host tcp-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:18] <icinga-wm>	 PROBLEM - Host durum3006 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:40] <claime>	 I imagine that's the switch reboot?
[13:39:57] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:40:00] <icinga-wm>	 PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 185.15.59.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:40:04] <claime>	 !ack
[13:40:05] <sirenbot>	 7837 (ACKED)  [7x] ProbeDown sre (probes/service esams)
[13:40:13] <logmsgbot>	 !log stran@deploy1003 stran: Backport for [[gerrit:1270905|Update webonyx/graphql-php to 15.31.5 (T423216)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:40:23] <volans>	 was not depooled?
[13:40:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:40:38] <volans>	 *downtimed
[13:40:42] <Tran>	 testing now
[13:40:52] <claime>	 volans: Do we have "trickling" downtimes ?
[13:41:08] <claime>	 (i.e. silencing the switch silences host down for hosts plugged into it?)
[13:41:21] <jinxer-wm>	 FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/0 (Core: asw1-bw27-esams:et-0/0/48 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:41:23] <cdanis>	 claime: lol
[13:41:30] <jinxer-wm>	 FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs3009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[13:41:30] <XioNoX>	 claime: in icinga yes, but not in alert-manager
[13:41:35] <volans>	 ^^^
[13:41:35] <claime>	 cdanis: I figured
[13:41:43] <Tran>	 nothing looks broken, continuing backport
[13:41:45] <jinxer-wm>	 RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:41:47] <logmsgbot>	 !log stran@deploy1003 stran: Continuing with sync
[13:41:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments1009 - https://phabricator.wikimedia.org/T416253#11819656 (10VRiley-WMF) a:05VRiley-WMF→03Jgreen
[13:41:52] <volans>	 when the old system is better than the new one :D
[13:41:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:42:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:42:00] <XioNoX>	 claime: I did downtime the hosts with `P{P:netbox::host%location ~ "BW.*esams"}` but of course it didn't downtime the VMs...
[13:42:19] <claime>	 XioNoX: ah :D
[13:42:24] <volans>	 nor the global services not attached to a host
[13:42:32] <XioNoX>	 yeah...
[13:42:48] <volans>	 the vms are easy to add ;)
[13:43:02] <XioNoX>	 volans: are they ? :) we need to query on which hosts they are
[13:43:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T419635)', diff saved to https://phabricator.wikimedia.org/P90664 and previous config saved to /var/cache/conftool/dbconfig/20260414-134350-fceratto.json
[13:43:54] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:43:55] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:44:09] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance
[13:44:11] <volans>	 we have cluster and group
[13:44:16] <jinxer-wm>	 FIRING: [20x] JobUnavailable: Reduced availability for job benthos in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:44:17] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T419635)', diff saved to https://phabricator.wikimedia.org/P90665 and previous config saved to /var/cache/conftool/dbconfig/20260414-134416-fceratto.json
[13:44:57] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:45:30] <logmsgbot>	 !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270905|Update webonyx/graphql-php to 15.31.5 (T423216)]] (duration: 07m 05s)
[13:45:34] <stashbot>	 T423216: Update webonyx/graphql-php to 15.31.5 - https://phabricator.wikimedia.org/T423216
[13:45:50] <wikibugs>	 (03CR) 10STran: "recheck" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270888 (https://phabricator.wikimedia.org/T423042) (owner: 10STran)
[13:45:55] <jinxer-wm>	 FIRING: [5x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:46:09] <jinxer-wm>	 FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:46:11] <Tran>	 done and I think that's it for backports this window
[13:46:30] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs3009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[13:46:50] <XioNoX>	 not sure how I can downtime that paging alert ahead of time?
[13:47:04] <volans>	 that's the tricky one
[13:47:14] <volans>	 you can downtime it but will masquerade other issues
[13:47:45] <Lucas_WMDE>	 Tran: thanks! (and thanks Dreamy_Jazz too!)
[13:47:50] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:47:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:41] <XioNoX>	 switch is back up on console, waiting for the ports to show up
[13:49:02] <icinga-wm>	 RECOVERY - Router interfaces on mr1-esams is OK: OK: host 185.15.59.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:49:16] <jinxer-wm>	 FIRING: [20x] JobUnavailable: Reduced availability for job benthos in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:49:22] <XioNoX>	 here are the recoveries
[13:49:54] <icinga-wm>	 RECOVERY - Host durum3005 is UP: PING OK - Packet loss = 0%, RTA = 80.61 ms
[13:49:56] <icinga-wm>	 RECOVERY - Host tcp-proxy3001 is UP: PING OK - Packet loss = 0%, RTA = 80.77 ms
[13:49:56] <icinga-wm>	 RECOVERY - Host durum3006 is UP: PING OK - Packet loss = 0%, RTA = 80.59 ms
[13:49:57] <icinga-wm>	 RECOVERY - Host tcp-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.62 ms
[13:49:57] <jinxer-wm>	 RESOLVED: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:49:58] <icinga-wm>	 RECOVERY - Host ncredir3005 is UP: PING OK - Packet loss = 0%, RTA = 80.70 ms
[13:50:16] <icinga-wm>	 RECOVERY - Host doh3006 is UP: PING OK - Packet loss = 0%, RTA = 80.71 ms
[13:50:27] <jinxer-wm>	 FIRING: [9x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:50:28] <XioNoX>	 volans: and how to downtime the VMs in a given rack?
[13:50:29] <volans>	 XioNoX: with routed ganeti is indeed harder, the cluster is just one for all vms
[13:50:41] <XioNoX>	 yeah
[13:50:47] <volans>	 ganeti_cluster: esams03, ganeti_group: B (from hiera)
[13:51:21] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/0 (Core: asw1-bw27-esams:et-0/0/48 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:51:30] <jinxer-wm>	 RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs3009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[13:52:00] <jinxer-wm>	 FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:52:44] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935536 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:52:48] <XioNoX>	 volans: and the only reason it's paging is because the depool threshold is more than a rack, so it sends the LVS healthchecks to the offline realservers...
[13:53:07] <volans>	 ah
[13:53:28] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2214.codfw.wmnet with OS trixie
[13:54:06] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2214: after reimage to trixie
[13:54:16] <jinxer-wm>	 RESOLVED: [20x] JobUnavailable: Reduced availability for job benthos in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:54:25] <logmsgbot>	 !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 12 hosts with reason: router upgrade
[13:54:36] <XioNoX>	 I'm going to reboot rack BW27 shortly
[13:54:57] <volans>	 XioNoX: cumin 'F:lldp.parent ~ "ganeti300[6,8]\..*"'
[13:55:04] <volans>	 for the downtime of the vms of BW
[13:55:08] <logmsgbot>	 !log mszwarc@deploy1003 mwscript-k8s job started: foreachwikiindblist all backfillInterwikiRightsLog.php --remote-wiki metawiki 20260311190000  # T6055 (third attempt)
[13:55:11] <stashbot>	 T6055: Interwiki rights logs should be duplicated at related wikis - https://phabricator.wikimedia.org/T6055
[13:55:13] <volans>	 you could do the same for the other one with the other ganeti
[13:55:22] <XioNoX>	 volans: nice! thanks
[13:55:27] <jinxer-wm>	 RESOLVED: [9x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:55:43] <logmsgbot>	 !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw1-by27-esams,asw1-by27-esams IPv6,asw1-by27-esams.mgmt with reason: router upgrade
[13:56:17] <wikibugs>	 (03PS4) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509)
[13:56:30] <XioNoX>	 volans: Scheduling downtime on Icinga server alert1002.wikimedia.org for hosts: bast3007,doh3005,hcaptcha-proxy[3001-3002],install3004,ncredir3006,netflow3004,prometheus3004
[13:56:31] <XioNoX>	 nice
[13:56:41] <volans>	 :D
[13:56:45] <logmsgbot>	 !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: router upgrade
[13:56:52] <XioNoX>	 I need to document it somewhere :)
[13:57:26] <XioNoX>	 !log asw1-by27-esams> request system reboot - T416450
[13:57:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:29] <stashbot>	 T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
[13:57:44] <jinxer-wm>	 RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935536 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:57:45] <volans>	 so we should expect a page too right?
[13:58:06] <XioNoX>	 volans: yeah, unlike we're lucky and the altermanager probes only arrive on a working backend :)
[13:58:10] <volans>	 :D
[14:00:05] <jouncebot>	 Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1400)
[14:01:03] <icinga-wm>	 PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 185.15.59.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:02:00] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:02:12] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T419635)', diff saved to https://phabricator.wikimedia.org/P90667 and previous config saved to /var/cache/conftool/dbconfig/20260414-140211-fceratto.json
[14:02:16] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:04:22] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: access logging with Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1270951 (https://phabricator.wikimedia.org/T421827)
[14:04:24] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270951 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[14:07:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Make cn=growthbook-admin managed in Bitu [puppet] - 10https://gerrit.wikimedia.org/r/1270952 (https://phabricator.wikimedia.org/T420688)
[14:10:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2214: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270930 (owner: 10Marostegui)
[14:11:03] <icinga-wm>	 RECOVERY - Router interfaces on mr1-esams is OK: OK: host 185.15.59.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:11:10] <XioNoX>	 volans: no page :)
[14:11:28] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:11:28] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:11:33] <volans>	 too soon?
[14:11:35] <volans>	 !ack
[14:11:36] <sirenbot>	 7838 (ACKED)  [4x] ProbeDown sre (ip4 probes/service esams)
[14:11:38] <vgutierrez>	 jinxed it :)
[14:11:48] <XioNoX>	 weird, it started paging when the switch came back up :)
[14:11:49] <XioNoX>	 hahaha
[14:11:52] <XioNoX>	 yeah
[14:12:03] <volans>	 XioNoX: I also noticed some imbalance of VMs distribution between the racks, mentioned that to suk.he
[14:12:15] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:12:20] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P90669 and previous config saved to /var/cache/conftool/dbconfig/20260414-141219-fceratto.json
[14:12:45] <XioNoX>	 volans: can you link it to https://phabricator.wikimedia.org/T395883 ?
[14:13:24] <volans>	 sure
[14:13:38] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.remove-downtime for 8 hosts
[14:13:43] <XioNoX>	 removing the downtimes
[14:13:44] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 8 hosts
[14:13:48] <XioNoX>	 but we're all good network wise
[14:14:03] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.remove-downtime for 12 hosts
[14:14:10] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 12 hosts
[14:14:17] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.remove-downtime for 13 hosts
[14:14:25] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 13 hosts
[14:15:25] <wikibugs>	 (03PS1) 10Cwhite: opensearch: correct o11y usage in comment [puppet] - 10https://gerrit.wikimedia.org/r/1270953 (https://phabricator.wikimedia.org/T422860)
[14:16:02] <XioNoX>	 repooling esams
[14:16:02] <volans>	 XioNoX: commented there
[14:16:21] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool esams [reason: network maintenance finished, T416450]
[14:16:24] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool esams [reason: network maintenance finished, T416450]
[14:16:25] <stashbot>	 T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
[14:16:28] <jinxer-wm>	 RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:16:28] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:16:39] <logmsgbot>	 !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2068.codfw.wmnet with OS bullseye
[14:16:44] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11819787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye execu...
[14:16:58] <logmsgbot>	 !log ayounsi@cumin1003 conftool action : set/pooled=yes; selector: cluster=dnsbox,dc=esams [reason: esams maintenance over]
[14:17:14] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:18:20] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286 (10MatthewVernon) 03NEW
[14:18:32] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:20:01] <cdanis>	 jouncebot: nowandnext
[14:20:01] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1400)
[14:20:01] <jouncebot>	 In 0 hour(s) and 9 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1430)
[14:20:45] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1270952 (https://phabricator.wikimedia.org/T420688) (owner: 10Muehlenhoff)
[14:22:28] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P90670 and previous config saved to /var/cache/conftool/dbconfig/20260414-142227-fceratto.json
[14:22:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11819840 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgraded to 23.4R2-S8  and all is well.
[14:22:54] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet
[14:23:55] <wikibugs>	 (03Abandoned) 10Ayounsi: Temporarily geodns GB and IE to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1268538 (https://phabricator.wikimedia.org/T416450) (owner: 10Ayounsi)
[14:24:46] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ms-be2068.codfw.wmnet
[14:25:19] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet
[14:25:33] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ms-be2068.codfw.wmnet
[14:25:41] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet
[14:26:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2068.codfw.wmnet
[14:26:11] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet
[14:28:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make cn=growthbook-admin managed in Bitu [puppet] - 10https://gerrit.wikimedia.org/r/1270952 (https://phabricator.wikimedia.org/T420688) (owner: 10Muehlenhoff)
[14:29:44] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11819873 (10MatthewVernon) At the suggestion of @elukey on IRC, I am trying a firmware downgrade to 6.10.30.20 (the version 5.0.20.0 that this system started wit...
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1430)
[14:30:19] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove disk-based ssh keys for me, Andrew Bogott [puppet] - 10https://gerrit.wikimedia.org/r/1270956
[14:30:28] <wikibugs>	 (03CR) 10Clément Goubert: "That's strange, as far as I can tell, ATS does path normalization for `%3A` in [0], and it is in the plugin chain for `/service/` [1] so w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[14:30:53] <wikibugs>	 (03PS1) 10CDanis: SwiftFileBackend: propagate tracing context to HTTP client [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270957
[14:32:36] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T419635)', diff saved to https://phabricator.wikimedia.org/P90672 and previous config saved to /var/cache/conftool/dbconfig/20260414-143235-fceratto.json
[14:32:40] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:32:53] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance
[14:33:01] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T419635)', diff saved to https://phabricator.wikimedia.org/P90673 and previous config saved to /var/cache/conftool/dbconfig/20260414-143301-fceratto.json
[14:34:10] <cdanis>	 anyone from Test Kitchen using this deploy window?
[14:34:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1270956 (owner: 10Andrew Bogott)
[14:34:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Remove disk-based ssh keys for me, Andrew Bogott [puppet] - 10https://gerrit.wikimedia.org/r/1270956 (owner: 10Andrew Bogott)
[14:35:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2068.codfw.wmnet
[14:35:07] <wikibugs>	 (03PS1) 10Marostegui: data.yaml: Remove my old key [puppet] - 10https://gerrit.wikimedia.org/r/1270958
[14:35:58] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270959 (https://phabricator.wikimedia.org/T422509)
[14:36:02] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye
[14:36:14] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11819920 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye
[14:36:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270959 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott)
[14:37:15] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:38:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1270958 (owner: 10Marostegui)
[14:39:12] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] data.yaml: Remove my old key [puppet] - 10https://gerrit.wikimedia.org/r/1270958 (owner: 10Marostegui)
[14:39:31] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2214: after reimage to trixie
[14:40:11] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270960 (https://phabricator.wikimedia.org/T422509)
[14:40:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270960 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott)
[14:43:15] <wikibugs>	 (03Abandoned) 10Andrew Bogott: cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270959 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott)
[14:43:21] <wikibugs>	 (03Abandoned) 10Andrew Bogott: cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270960 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott)
[14:44:38] <logmsgbot>	 !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:44:46] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11820039 (10Scott_French)
[14:48:31] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11820087 (10Scott_French) @MLechvien-WMF - I've updated the task description to capture the discussion here...
[14:49:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:49:55] <logmsgbot>	 !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:49:56] <wikibugs>	 (03PS1) 10Papaul: Add my FIDO backup key [puppet] - 10https://gerrit.wikimedia.org/r/1270962 (https://phabricator.wikimedia.org/T423293)
[14:49:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:50:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:50:46] <logmsgbot>	 !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:51:08] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T419635)', diff saved to https://phabricator.wikimedia.org/P90676 and previous config saved to /var/cache/conftool/dbconfig/20260414-145107-fceratto.json
[14:51:11] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:51:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:51:45] <wikibugs>	 (03PS3) 10Blake: service: exclude apus from the switchover. [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (https://phabricator.wikimedia.org/T422166)
[14:51:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:53:28] <wikibugs>	 (03CR) 10Elukey: istio: revisit Prometheus buckets for Wikikube (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[14:53:53] <wikibugs>	 (03PS1) 10Ayounsi: move-vlan cookbook: add "inplace" support [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965
[14:54:20] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[14:54:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:54:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:56:04] <logmsgbot>	 !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:56:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:56:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:56:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:59:02] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Blake!" [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (https://phabricator.wikimedia.org/T422166) (owner: 10Blake)
[14:59:28] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[14:59:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11820175 (10WMDE-leszek) 05Resolved→03Open hello @MatthewVernon, hello all. I don't know if you'll able to troubleshoot it somehow but it seems that `nicho...
[14:59:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:00:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:00:05] <jouncebot>	 jelto, arnoldokoth, mutante, and arnaudb: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1500). Please do the needful.
[15:01:16] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P90677 and previous config saved to /var/cache/conftool/dbconfig/20260414-150115-fceratto.json
[15:01:32] <wikibugs>	 (03PS1) 10Federico Ceratto: sre.mysql.depool: Do not require tmux/screen [cookbooks] - 10https://gerrit.wikimedia.org/r/1270963
[15:02:20] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest-gateway: anon-browser -> 200 (shadow) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270968
[15:03:57] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway: anon-browser -> 200 (shadow) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270968 (owner: 10Daniel Kinzler)
[15:04:13] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] redioscope: capture rate limit window duration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270901 (owner: 10Daniel Kinzler)
[15:05:03] <wikibugs>	 (03CR) 10Jasmine: "Thanks, done!" [dns] - 10https://gerrit.wikimedia.org/r/1265480 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine)
[15:05:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:06:00] <claime>	 jouncebot: nowandnext
[15:06:00] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1500)
[15:06:00] <jouncebot>	 In 0 hour(s) and 53 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1600)
[15:06:07] <wikibugs>	 (03PS4) 10Blake: service: exclude apus from the switchover. [puppet] - 10https://gerrit.wikimedia.org/r/1269382
[15:06:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:07:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:07:21] <wikibugs>	 (03PS5) 10Blake: service: exclude apus from the switchover. [puppet] - 10https://gerrit.wikimedia.org/r/1269382
[15:07:34] <wikibugs>	 (03CR) 10Blake: service: exclude apus from the switchover. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (owner: 10Blake)
[15:08:49] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] service: exclude apus from the switchover. [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (owner: 10Blake)
[15:09:06] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] redioscope: capture rate limit window duration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270901 (owner: 10Daniel Kinzler)
[15:09:20] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: anon-browser -> 200 (shadow) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270968 (owner: 10Daniel Kinzler)
[15:10:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:10:04] <wikibugs>	 (03PS1) 10Vgutierrez: admin: Remove legacy ssh key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/1270969
[15:10:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1270969 (owner: 10Vgutierrez)
[15:10:56] <wikibugs>	 (03PS1) 10CDanis: puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270970
[15:11:07] <wikibugs>	 (03Merged) 10jenkins-bot: redioscope: capture rate limit window duration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270901 (owner: 10Daniel Kinzler)
[15:11:09] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270970 (owner: 10CDanis)
[15:11:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P90678 and previous config saved to /var/cache/conftool/dbconfig/20260414-151123-fceratto.json
[15:11:30] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: anon-browser -> 200 (shadow) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270968 (owner: 10Daniel Kinzler)
[15:12:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:12:19] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (owner: 10Blake)
[15:12:34] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:12:43] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] admin: Remove legacy ssh key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/1270969 (owner: 10Vgutierrez)
[15:12:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270970 (owner: 10CDanis)
[15:13:08] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:13:16] <wikibugs>	 (03PS1) 10CDanis: puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971
[15:13:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:15:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:15:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis)
[15:15:50] <logmsgbot>	 !log daniel@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply
[15:16:00] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11820271 (10MatthewVernon) As before, post-installer boot was fine, but after puppet it gets as far as: ` Booting from Hard drive C: GRUB `  and hangs.
[15:16:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:16:40] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:17:26] <logmsgbot>	 !log daniel@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply
[15:18:10] <wikibugs>	 (03Abandoned) 10CDanis: puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270970 (owner: 10CDanis)
[15:18:12] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[15:18:41] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[15:20:04] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:20:06] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:20:55] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:20:58] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270975 (https://phabricator.wikimedia.org/T422509)
[15:21:01] <wikibugs>	 (03PS2) 10CDanis: puppetserver: install cidergrinder, run nightly grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971
[15:21:06] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis)
[15:21:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T419635)', diff saved to https://phabricator.wikimedia.org/P90679 and previous config saved to /var/cache/conftool/dbconfig/20260414-152132-fceratto.json
[15:21:36] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[15:21:48] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance
[15:21:57] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T419635)', diff saved to https://phabricator.wikimedia.org/P90680 and previous config saved to /var/cache/conftool/dbconfig/20260414-152156-fceratto.json
[15:21:58] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for <Passimacopoulos> - https://phabricator.wikimedia.org/T423301 (10Passimacopoulos) 03NEW
[15:22:20] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] Add Kubernetes POD IP reverse range delegations for wikikube-ctrl200[4-5] [dns] - 10https://gerrit.wikimedia.org/r/1265480 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine)
[15:22:40] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:23:02] <logmsgbot>	 !log jasmine@dns1004 START - running authdns-update
[15:23:25] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11820314 (10MatthewVernon) @WMDE-leszek analytics_privatedata_users isn't an LDAP group, it's a shell group, so it wouldn't appear in the ldap listing (for ins...
[15:23:54] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[15:23:58] <wikibugs>	 (03PS1) 10Daniel Kinzler: redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976
[15:24:26] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[15:24:29] <wikibugs>	 (03PS1) 10CDanis: hiddenparma: add atsuko [labs/private] - 10https://gerrit.wikimedia.org/r/1270977
[15:24:29] <logmsgbot>	 !log jasmine@dns1004 END - running authdns-update
[15:24:46] <wikibugs>	 (03CR) 10CDanis: [C:03+2] hiddenparma: add atsuko [labs/private] - 10https://gerrit.wikimedia.org/r/1270977 (owner: 10CDanis)
[15:25:05] <wikibugs>	 (03CR) 10CDanis: [V:03+2 C:03+2] hiddenparma: add atsuko [labs/private] - 10https://gerrit.wikimedia.org/r/1270977 (owner: 10CDanis)
[15:25:24] <wikibugs>	 (03PS2) 10Daniel Kinzler: redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976
[15:25:49] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis)
[15:26:38] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[15:27:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler)
[15:29:08] <wikibugs>	 (03PS3) 10CDanis: puppetserver: install cidergrinder, run daily grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971
[15:29:16] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis)
[15:30:22] <wikibugs>	 (03CR) 10Klausman: [C:03+1] Update the systemd units to wait for udev before starting [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1270927 (owner: 10Elukey)
[15:30:32] <wikibugs>	 (03PS1) 10Volans: sre.ganeti: add new pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883)
[15:30:59] <cdanis>	 jouncebot: nowandnext
[15:30:59] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1500)
[15:30:59] <jouncebot>	 In 0 hour(s) and 29 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1600)
[15:31:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11820350 (10WMDE-leszek) 05Open→03Resolved ah, thanks @MatthewVernon, I forgot about this detail. Shell groups seem ok then, so I'll close this ticket...
[15:32:19] <wikibugs>	 (03CR) 10CDanis: [C:03+2] SwiftFileBackend: propagate tracing context to HTTP client [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270957 (owner: 10CDanis)
[15:32:49] <logmsgbot>	 !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2068.codfw.wmnet with OS bullseye
[15:32:55] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11820361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye execu...
[15:33:00] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2068.codfw.wmnet
[15:33:07] <wikibugs>	 (03PS2) 10Volans: sre.ganeti: add new pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883)
[15:33:28] <wikibugs>	 (03PS3) 10Volans: sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883)
[15:34:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270957 (owner: 10CDanis)
[15:35:08] <wikibugs>	 (03CR) 10Klausman: [C:03+1] istio: revisit Prometheus buckets for Wikikube (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[15:36:10] <logmsgbot>	 mvernon@cumin2002 upgrade-firmware (PID 3326764) is awaiting input
[15:36:45] <wikibugs>	 (03Merged) 10jenkins-bot: SwiftFileBackend: propagate tracing context to HTTP client [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270957 (owner: 10CDanis)
[15:37:11] <logmsgbot>	 !log cdanis@deploy1003 Started scap sync-world: Backport for [[gerrit:1270957|SwiftFileBackend: propagate tracing context to HTTP client]]
[15:38:59] <logmsgbot>	 !log cdanis@deploy1003 cdanis: Backport for [[gerrit:1270957|SwiftFileBackend: propagate tracing context to HTTP client]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:39:17] <cdanis>	 hey Amir1 you around? 👀
[15:39:44] <Amir1>	 cdanis: meeting
[15:39:50] <wikibugs>	 (03CR) 10Clément Goubert: "`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[15:39:51] <wikibugs>	 (03PS1) 10Scott French: Set initialDelaySeconds on aqs-http-gateway direct Cassandra clients [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168)
[15:40:40] <wikibugs>	 (03CR) 10Scott French: "Just preparing this in case we decide to go this route." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French)
[15:40:46] <cdanis>	 lol I don't have access to upload to testwiki?
[15:41:41] <logmsgbot>	 !log cdanis@deploy1003 cdanis: Continuing with sync
[15:43:03] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T419635)', diff saved to https://phabricator.wikimedia.org/P90681 and previous config saved to /var/cache/conftool/dbconfig/20260414-154302-fceratto.json
[15:43:06] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[15:43:20] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11820398 (10Blake) I'll merge the exclusion patch and work on updating the docs tomorrow.  I'm inclined to...
[15:45:35] <logmsgbot>	 !log cdanis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270957|SwiftFileBackend: propagate tracing context to HTTP client]] (duration: 08m 24s)
[15:46:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11820425 (10BCornwall) We've decided to move forward with this task. Would dcops be willing to handle the NIC revert in lvs1017?
[15:47:35] <wikibugs>	 (03PS3) 10Daniel Kinzler: redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976
[15:50:16] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2068.codfw.wmnet
[15:52:10] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye
[15:52:17] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11820469 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye
[15:53:10] <wikibugs>	 (03PS1) 10Robertsky: Update wikimania wordmark for 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331)
[15:53:11] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P90682 and previous config saved to /var/cache/conftool/dbconfig/20260414-155310-fceratto.json
[15:54:07] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11820483 (10MatthewVernon) Tried a BIOS upgrade from 2.12.2 to 2.24.0. That didn't cause the system to become bootable, but trying yet another reimage.  Plan is...
[15:54:11] <wikibugs>	 (03PS2) 10Ayounsi: move-vlan cookbook: add "inplace" support [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965
[15:56:07] <wikibugs>	 (03PS1) 10Jforrester: wmgMonologChannels: Add WikiLambda* sub-channels, all at debug for some quick checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270987
[15:56:07] <wikibugs>	 (03PS1) 10Jforrester: wmgMonologChannels: Reduce WikiLambda* sub-channels logging from debug to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270988
[15:56:36] <wikibugs>	 (03PS5) 10Pppery: Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054)
[15:56:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky)
[15:57:03] <wikibugs>	 (03PS4) 10Volans: sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883)
[15:57:31] <wikibugs>	 (03CR) 10Pppery: Drop 1.5x logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery)
[15:57:42] <James_F>	 I'm going to deploy ^ to debug the Abstract Wiki caching issue unless someone shouts.
[15:59:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) (owner: 10Volans)
[15:59:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270987 (owner: 10Jforrester)
[16:00:05] <jouncebot>	 jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:01:22] <wikibugs>	 (03Merged) 10jenkins-bot: wmgMonologChannels: Add WikiLambda* sub-channels, all at debug for some quick checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270987 (owner: 10Jforrester)
[16:01:47] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1270987|wmgMonologChannels: Add WikiLambda* sub-channels, all at debug for some quick checks]]
[16:01:52] <wikibugs>	 (03PS5) 10Volans: sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883)
[16:03:19] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P90683 and previous config saved to /var/cache/conftool/dbconfig/20260414-160319-fceratto.json
[16:03:40] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1270987|wmgMonologChannels: Add WikiLambda* sub-channels, all at debug for some quick checks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:03:43] <wikibugs>	 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307 (10herron) 03NEW p:05Triage→03Medium
[16:04:34] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[16:04:46] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) (owner: 10Volans)
[16:06:01] <wikibugs>	 (03CR) 10Clément Goubert: redioscope: move survey to service defintion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler)
[16:07:23] <wikibugs>	 (03CR) 10Volans: [C:03+2] sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) (owner: 10Volans)
[16:08:12] <wikibugs>	 (03PS4) 10Daniel Kinzler: redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976
[16:08:12] <wikibugs>	 (03PS1) 10Sbisson: Register ArticleGuidance extension and enable in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270991 (https://phabricator.wikimedia.org/T423295)
[16:08:20] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270987|wmgMonologChannels: Add WikiLambda* sub-channels, all at debug for some quick checks]] (duration: 06m 32s)
[16:09:16] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:09:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270988 (owner: 10Jforrester)
[16:10:22] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[16:11:02] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11820559 (10herron)
[16:11:27] <wikibugs>	 (03Merged) 10jenkins-bot: sre.ganeti: add check-pop-vm-redundancy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1270978 (https://phabricator.wikimedia.org/T395883) (owner: 10Volans)
[16:11:43] <wikibugs>	 (03Merged) 10jenkins-bot: wmgMonologChannels: Reduce WikiLambda* sub-channels logging from debug to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270988 (owner: 10Jforrester)
[16:12:04] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1270988|wmgMonologChannels: Reduce WikiLambda* sub-channels logging from debug to info]]
[16:12:11] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] redioscope: move survey to service defintion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler)
[16:13:27] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T419635)', diff saved to https://phabricator.wikimedia.org/P90684 and previous config saved to /var/cache/conftool/dbconfig/20260414-161326-fceratto.json
[16:13:31] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[16:13:44] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance
[16:13:52] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90685 and previous config saved to /var/cache/conftool/dbconfig/20260414-161351-fceratto.json
[16:13:52] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1270988|wmgMonologChannels: Reduce WikiLambda* sub-channels logging from debug to info]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:15:30] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[16:15:39] <wikibugs>	 (03CR) 10Tjones: [C:03+1] "I only have +1 in this repo, but this looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse)
[16:16:20] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] opensearch: correct o11y usage in comment [puppet] - 10https://gerrit.wikimedia.org/r/1270953 (https://phabricator.wikimedia.org/T422860) (owner: 10Cwhite)
[16:16:45] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[16:16:49] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite)
[16:17:43] <wikibugs>	 (03PS1) 10Herron: puppet: remove pyrra modules/profiles [puppet] - 10https://gerrit.wikimedia.org/r/1270996 (https://phabricator.wikimedia.org/T423307)
[16:17:46] <wikibugs>	 (03PS1) 10Herron: pyrra: remove pyrra/slo/slos dns entries [dns] - 10https://gerrit.wikimedia.org/r/1270995 (https://phabricator.wikimedia.org/T423307)
[16:17:48] <wikibugs>	 (03PS2) 10Herron: pyrra: remove configuration for web interface [puppet] - 10https://gerrit.wikimedia.org/r/1270992 (https://phabricator.wikimedia.org/T423307)
[16:17:50] <wikibugs>	 (03PS4) 10Herron: pyrra: ensure absent on package and services [puppet] - 10https://gerrit.wikimedia.org/r/1270974 (https://phabricator.wikimedia.org/T423307)
[16:18:38] <wikibugs>	 (03CR) 10Daniel Kinzler: redioscope: move survey to service defintion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler)
[16:19:08] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:19:16] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler)
[16:20:15] <wikibugs>	 (03CR) 10Tjones: [C:03+1] "+1 is all I got! Seems reasonable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269464 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse)
[16:20:32] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270988|wmgMonologChannels: Reduce WikiLambda* sub-channels logging from debug to info]] (duration: 08m 27s)
[16:20:38] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for <Passimacopoulos> - https://phabricator.wikimedia.org/T423301#11820651 (10Aklapper) Hi @Passimacopoulos, welcome to Wikimedia Phabricator! Please also connect your [MediaWiki/SUL account](https://meta.wikimedia.org/wiki/Specia...
[16:21:06] <wikibugs>	 (03CR) 10Tjones: [C:03+1] search: add alt. completion indices to test keyword tokenizer (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse)
[16:21:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:21:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team: Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312 (10RobH) 03NEW
[16:21:30] <wikibugs>	 (03Merged) 10jenkins-bot: redioscope: move survey to service defintion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler)
[16:21:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:21:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11820687 (10RobH)
[16:21:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11820691 (10RobH)
[16:22:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11820694 (10RobH) a:03bking Please update the site.pp file with the insetup role for your team (detailed on https://wiki...
[16:23:35] <wikibugs>	 (03CR) 10Scott French: "@eevans@wikimedia.org - Depending on whether there's a straightforward solution to improve gocql client behavior, it probably makes sense " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French)
[16:23:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314 (10RobH) 03NEW
[16:24:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11820718 (10RobH) a:03bking Please update the site.pp file with the insetup role for your team (detailed on https://wikit...
[16:24:21] <wikibugs>	 (03CR) 10ArielGlenn: "I'm missing some cotext I think; could you say something about the time_bucket change? Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler)
[16:24:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11820727 (10RobH)
[16:28:26] <wikibugs>	 10SRE-Access-Requests, 13Patch-For-Review: Add Papaul FIDO backup SSH key - https://phabricator.wikimedia.org/T423293#11820743 (10Aklapper) + #SRE-Access-Requests per https://wikitech.wikimedia.org/wiki/Yubikey-SSH-FIDO
[16:29:46] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90686 and previous config saved to /var/cache/conftool/dbconfig/20260414-162945-fceratto.json
[16:29:49] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[16:32:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1270962 (https://phabricator.wikimedia.org/T423293) (owner: 10Papaul)
[16:34:04] <wikibugs>	 06SRE, 07SRE-Unowned, 07Incident Severity 1, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11820786 (10MLechvien-WMF)
[16:34:05] <logmsgbot>	 !log daniel@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply
[16:34:16] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:34:47] <wikibugs>	 06SRE, 07SRE-Unowned, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11820799 (10MLechvien-WMF)
[16:34:48] <logmsgbot>	 !log daniel@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply
[16:35:00] <logmsgbot>	 !log daniel@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/redioscope: apply
[16:35:15] <logmsgbot>	 !log daniel@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/redioscope: apply
[16:38:22] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "Ack" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe)
[16:38:25] <wikibugs>	 (03Abandoned) 10JHathaway: firewall: add cloud services [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway)
[16:39:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P90687 and previous config saved to /var/cache/conftool/dbconfig/20260414-163953-fceratto.json
[16:42:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for <Passimacopoulos> - https://phabricator.wikimedia.org/T423301#11820831 (10Rmaung) Thanks @Aklapper, I'll add this to our docs now. :)
[16:42:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:43:19] <wikibugs>	 (03Abandoned) 10JHathaway: mailman: send web posting through Spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1249415 (https://phabricator.wikimedia.org/T386559) (owner: 10JHathaway)
[16:43:23] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: T421398
[16:43:27] <stashbot>	 T421398: SystemdUnitFailed - zuul-executor - https://phabricator.wikimedia.org/T421398
[16:43:36] <wikibugs>	 (03Abandoned) 10JHathaway: firewall: add to role::wmcs::instance, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226371 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway)
[16:43:49] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on zuul2001.codfw.wmnet with reason: T421398
[16:44:05] <wikibugs>	 (03Abandoned) 10JHathaway: firewall: remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1213590 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway)
[16:44:31] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps vendordata: [puppet] - 10https://gerrit.wikimedia.org/r/1271000 (https://phabricator.wikimedia.org/T422509)
[16:44:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vendordata: update apt preferences [puppet] - 10https://gerrit.wikimedia.org/r/1270975 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott)
[16:45:30] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-vps vendordata: force puppet install during image creation [puppet] - 10https://gerrit.wikimedia.org/r/1271000 (https://phabricator.wikimedia.org/T422509)
[16:45:58] <wikibugs>	 (03Abandoned) 10JHathaway: backup1012: add to legacy slugs [cookbooks] - 10https://gerrit.wikimedia.org/r/1193229 (owner: 10JHathaway)
[16:46:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vendordata: force puppet install during image creation [puppet] - 10https://gerrit.wikimedia.org/r/1271000 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott)
[16:46:51] <wikibugs>	 (03Abandoned) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway)
[16:50:02] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P90688 and previous config saved to /var/cache/conftool/dbconfig/20260414-165001-fceratto.json
[16:52:18] <wikibugs>	 (03CR) 10JHathaway: "@vgutierrez@wikimedia.org is this worth doing?" [puppet] - 10https://gerrit.wikimedia.org/r/1192934 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway)
[16:58:06] <wikibugs>	 (03PS2) 10JHathaway: acme-chief: remove hiera purge guard [puppet] - 10https://gerrit.wikimedia.org/r/1192917 (https://phabricator.wikimedia.org/T401858)
[16:58:07] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] envoy: Add 1 retry for swift services [puppet] - 10https://gerrit.wikimedia.org/r/1270931 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[16:58:25] <wikibugs>	 (03PS1) 10Catrope: Enforce 2FA requirements for phase 1 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271001 (https://phabricator.wikimedia.org/T423118)
[16:58:39] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271001 (https://phabricator.wikimedia.org/T423118) (owner: 10Catrope)
[16:59:04] <wikibugs>	 (03CR) 10JHathaway: "@vgutierrez@wikimedia.org please review when you have a moment" [puppet] - 10https://gerrit.wikimedia.org/r/1192917 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway)
[16:59:59] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11820884 (10MatthewVernon) Same failure mode after the BIOS upgrade - post-installer boot is fine, after puppet it gets to: ` Booting from Hard drive C: GRUB ` a...
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1700)
[17:00:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90689 and previous config saved to /var/cache/conftool/dbconfig/20260414-170010-fceratto.json
[17:00:14] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[17:00:28] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2202.codfw.wmnet with reason: Maintenance
[17:02:10] <wikibugs>	 (03CR) 10Chlod Alejandro: Update wikimania wordmark for 2026 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) (owner: 10Robertsky)
[17:03:13] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps vendordata: typo fix in apt line [puppet] - 10https://gerrit.wikimedia.org/r/1271002 (https://phabricator.wikimedia.org/T422509)
[17:03:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vendordata: typo fix in apt line [puppet] - 10https://gerrit.wikimedia.org/r/1271002 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott)
[17:07:37] <taavi>	 !log updating caprica hostlists on cloud-hosts-in cr firewall policies
[17:07:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:39] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2203.codfw.wmnet with reason: Maintenance
[17:12:46] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2203 (T419635)', diff saved to https://phabricator.wikimedia.org/P90690 and previous config saved to /var/cache/conftool/dbconfig/20260414-171246-fceratto.json
[17:12:50] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[17:17:38] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-upload_eqiad - 9.2.13 Upgrade ()
[17:17:43] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-text_eqiad - 9.2.13 Upgrade ()
[17:19:56] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11820959 (10TheDJ)
[17:28:39] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T419635)', diff saved to https://phabricator.wikimedia.org/P90691 and previous config saved to /var/cache/conftool/dbconfig/20260414-172838-fceratto.json
[17:28:42] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[17:30:01] <wikibugs>	 (03PS1) 10Dzahn: jenkins: allow disabling jenkins even on the manager host [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109)
[17:38:31] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[17:38:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P90692 and previous config saved to /var/cache/conftool/dbconfig/20260414-173846-fceratto.json
[17:39:08] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:39:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:41:10] <wikibugs>	 (03PS1) 10Ladsgroup: src: Fix typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271018
[17:41:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:41:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:44:45] <wikibugs>	 (03PS2) 10Ladsgroup: envoy: Add 1 retry for swift services [puppet] - 10https://gerrit.wikimedia.org/r/1270931 (https://phabricator.wikimedia.org/T328872)
[17:44:51] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] envoy: Add 1 retry for swift services [puppet] - 10https://gerrit.wikimedia.org/r/1270931 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[17:44:59] <wikibugs>	 (03PS1) 10Dzahn: lists: notify apache2 service when config changes [puppet] - 10https://gerrit.wikimedia.org/r/1271019 (https://phabricator.wikimedia.org/T323208)
[17:45:23] <Amir1>	 jouncebot: nowandnext
[17:45:23] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1700)
[17:45:23] <jouncebot>	 In 0 hour(s) and 14 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1800)
[17:46:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:47:08] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:47:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:48:28] <wikibugs>	 (03CR) 10RLazarus: service::catalog: add sophroid service catalog entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine)
[17:48:55] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P90693 and previous config saved to /var/cache/conftool/dbconfig/20260414-174854-fceratto.json
[17:49:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:49:39] <jinxer-wm>	 FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:51:46] <wikibugs>	 (03CR) 10Dzahn: "I suggest to keep this simple and just do this instead: https://gerrit.wikimedia.org/r/1271019" [puppet] - 10https://gerrit.wikimedia.org/r/1270921 (https://phabricator.wikimedia.org/T323208) (owner: 10Arnaudb)
[17:51:49] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2068.codfw.wmnet with OS bullseye
[17:51:56] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11821132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye execu...
[17:54:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271018 (owner: 10Ladsgroup)
[17:54:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:55:51] <wikibugs>	 (03Merged) 10jenkins-bot: src: Fix typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271018 (owner: 10Ladsgroup)
[17:56:02] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-upload_eqiad - 9.2.13 Upgrade ()
[17:56:16] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1271018|src: Fix typos]]
[17:58:05] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1271018|src: Fix typos]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:59:03] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T419635)', diff saved to https://phabricator.wikimedia.org/P90694 and previous config saved to /var/cache/conftool/dbconfig/20260414-175902-fceratto.json
[17:59:04] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[17:59:06] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[17:59:20] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance
[17:59:28] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T419635)', diff saved to https://phabricator.wikimedia.org/P90695 and previous config saved to /var/cache/conftool/dbconfig/20260414-175927-fceratto.json
[18:00:05] <jouncebot>	 dduvall and dancy: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T1800).
[18:00:14] <dancy>	 o/
[18:00:38] <dancy>	 Amir1: Lemme know when you're clear
[18:00:53] <Amir1>	 Almost there
[18:00:55] <Amir1>	 sorry
[18:02:44] <dancy>	 No prob
[18:03:29] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271018|src: Fix typos]] (duration: 07m 13s)
[18:03:32] <Amir1>	 done
[18:03:40] <Amir1>	 dancy: ^ sorry for the wait
[18:04:29] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271021 (https://phabricator.wikimedia.org/T420482)
[18:04:29] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-text_eqiad - 9.2.13 Upgrade ()
[18:04:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271021 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot)
[18:05:23] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271021 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot)
[18:08:26] <wikibugs>	 (03PS1) 10JHathaway: jhathaway: remove non-fido ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1271025
[18:11:00] <logmsgbot>	 !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.24  refs T420482
[18:11:04] <stashbot>	 T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482
[18:12:01] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for <Passimacopoulos> - https://phabricator.wikimedia.org/T423301#11821236 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse
[18:12:03] <wikibugs>	 (03PS2) 10Dzahn: lists: notify apache2 service when config changes [puppet] - 10https://gerrit.wikimedia.org/r/1271019 (https://phabricator.wikimedia.org/T323208)
[18:14:16] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T419635)', diff saved to https://phabricator.wikimedia.org/P90696 and previous config saved to /var/cache/conftool/dbconfig/20260414-181416-fceratto.json
[18:14:20] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[18:15:26] <icinga-wm>	 PROBLEM - Host ms-be2068 is DOWN: PING CRITICAL - Packet loss = 100%
[18:16:33] <wikibugs>	 (03PS3) 10Jforrester: Make abstractwiki a multi-lingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420)
[18:17:02] <wikibugs>	 (03CR) 10Jforrester: "Plan is to do this tomorrow, once Wikidata has wmf.24 with the new messages." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420) (owner: 10Jforrester)
[18:17:54] <wikibugs>	 (03Abandoned) 10Jforrester: mc: Shift the Wikifunctions MC route from /local/wf/ to /<dc>/wf-wan/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247687 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester)
[18:22:16] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] Enforce 2FA requirements for phase 1 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271001 (https://phabricator.wikimedia.org/T423118) (owner: 10Catrope)
[18:24:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P90697 and previous config saved to /var/cache/conftool/dbconfig/20260414-182424-fceratto.json
[18:26:59] <wikibugs>	 (03PS1) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948)
[18:27:27] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps vendordata: slightly more cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1271029 (https://phabricator.wikimedia.org/T422509)
[18:28:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vendordata: slightly more cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1271029 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott)
[18:28:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[18:29:03] <wikibugs>	 (03PS5) 10JHathaway: sysctls: add optional module param to sysctl::parameters [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726)
[18:29:55] <wikibugs>	 (03PS2) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948)
[18:30:03] <wikibugs>	 (03CR) 10JHathaway: sysctls: add optional module param to sysctl::parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway)
[18:31:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[18:32:47] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway)
[18:34:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P90698 and previous config saved to /var/cache/conftool/dbconfig/20260414-183432-fceratto.json
[18:35:27] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Add my FIDO backup key [puppet] - 10https://gerrit.wikimedia.org/r/1270962 (https://phabricator.wikimedia.org/T423293) (owner: 10Papaul)
[18:36:22] <wikibugs>	 (03PS3) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948)
[18:36:43] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[18:40:20] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans)
[18:40:57] <wikibugs>	 (03CR) 10CDanis: fundraising_data_import maintenance script wrapper & timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[18:41:01] <wikibugs>	 (03PS10) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830)
[18:41:01] <wikibugs>	 (03PS10) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830)
[18:41:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1271025 (owner: 10JHathaway)
[18:41:46] <wikibugs>	 (03CR) 10CDanis: "Proof-of-concept -- PTAL :)" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[18:42:10] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] jhathaway: remove non-fido ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1271025 (owner: 10JHathaway)
[18:42:55] <wikibugs>	 (03PS1) 10C. Scott Ananian: ParsoidLanguageConverter: convert inside <indicator> [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271030 (https://phabricator.wikimedia.org/T422961)
[18:43:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271030 (https://phabricator.wikimedia.org/T422961) (owner: 10C. Scott Ananian)
[18:44:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T419635)', diff saved to https://phabricator.wikimedia.org/P90699 and previous config saved to /var/cache/conftool/dbconfig/20260414-184440-fceratto.json
[18:44:44] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[18:48:31] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[18:52:04] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans)
[18:54:18] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for <Passimacopoulos> - https://phabricator.wikimedia.org/T423301#11821375 (10Rmaung) Adding my approval as Paris' supervisor if that is needed. Paris will need level 3 access with Kerberos.
[19:05:46] <wikibugs>	 (03PS1) 10Dzahn: integration: switch integration-agent-docker VMs to Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109)
[19:10:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270592 (https://phabricator.wikimedia.org/T423173) (owner: 10Bvibber)
[19:11:16] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] reposync: don't enforce ownership after init [puppet] - 10https://gerrit.wikimedia.org/r/993797 (owner: 10JHathaway)
[19:13:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:14:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:14:12] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:16:01] <swfrench-wmf>	 FYI, I'll be applying some pending external-services network policy diffs to wikikube clusters
[19:16:35] <logmsgbot>	 !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1026.eqiad.wmnet with reason: Bootstrapping — T412830
[19:16:39] <stashbot>	 T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830
[19:19:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:19:21] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[19:19:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[19:20:11] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[19:21:07] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service aqs1026-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:21:25] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[19:22:04] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[19:23:39] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[19:24:27] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[19:26:07] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1026-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:27:30] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[19:27:58] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[19:28:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for <Passimacopoulos> - https://phabricator.wikimedia.org/T423301#11821462 (10andrea.denisse)
[19:30:38] <swfrench-wmf>	 !log applied external-services network policy updates for cassandra-analytics-query-service-storage-[ab]-eqiad (aqs1026) and dumps-wikimedia in wikikube clusters
[19:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:09] <wikibugs>	 (03CR) 10Dzahn: gerrit: update sync-instances cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb)
[19:33:54] <wikibugs>	 (03CR) 10Dzahn: "Since we do this change at runtime with a curl command we might not need this change at all - unless it's to be permanent." [puppet] - 10https://gerrit.wikimedia.org/r/1270951 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[19:37:40] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] gerrit: migrate gerrit_site away from root partition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb)
[19:38:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:40:43] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "cloud-vps vendordata: slightly more cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1271034
[19:41:09] <wikibugs>	 (03CR) 10Dzahn: "not possible to compile changes here? Hosts that were skipped (fail fast)" [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn)
[19:41:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270991 (https://phabricator.wikimedia.org/T423295) (owner: 10Sbisson)
[19:41:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "cloud-vps vendordata: slightly more cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1271034 (owner: 10Andrew Bogott)
[19:49:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:49:12] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:49:33] <wikibugs>	 (03CR) 10Eevans: [C:03+1] "@swfrench@wikimedia.org agreed!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French)
[19:50:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:50:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[19:53:26] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:54:06] <wikibugs>	 (03PS2) 10Dzahn: jenkins: allow disabling jenkins even on the manager host [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109)
[19:55:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:55:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T2000).
[20:00:05] <jouncebot>	 maryum, Robertsky, RoanKattouw, cscott, and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:26] <maryum>	 I'm planning to deploy with spiderpig
[20:00:46] <wikibugs>	 (03PS1) 10C. Scott Ananian: LanguageConverter: Allow disabling top-level variant "guess" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271038 (https://phabricator.wikimedia.org/T419328)
[20:00:50] <cscott>	 o/
[20:00:56] <cscott>	 i'm also going to spiderpig it
[20:00:57] <robertsky>	 o/
[20:01:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271038 (https://phabricator.wikimedia.org/T419328) (owner: 10C. Scott Ananian)
[20:01:22] <robertsky>	 standing by for someone to deploy mine. :)
[20:01:56] <RoanKattouw>	 I can deploy my own patch and robertsky's and whoever else needs me to deploy theirs (but after cscott has gone)
[20:02:02] <bvibber>	 o/
[20:02:18] <bvibber>	 i can spiderpig my config patch (or folks can bundle it with other config patches)
[20:02:31] <cscott>	 maryum why don't you get started?  config patch should be fast.
[20:02:38] <maryum>	 yes doing it now
[20:02:46] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1271017/8416/contint1002.wikimedia.org/change.contint1002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn)
[20:03:04] <cscott>	 then i guess RoanKattouw is suggesting i should go next, and then he'll do his own patch and robertsky 's.
[20:03:08] <wikibugs>	 (03PS3) 10Dzahn: jenkins: allow disabling jenkins even on the manager host [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109)
[20:04:33] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for <Passimacopoulos> - https://phabricator.wikimedia.org/T423301#11821545 (10andrea.denisse)
[20:05:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270571 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles)
[20:05:37] <maryum>	 spiderpig is in progress
[20:06:34] <bvibber>	 can i just say again, as someone who's been intimately or tangentially involved with mediawiki deploys for 24 years, that spiderpig is *such* a wonderful democratizing tool <3
[20:06:47] <RoanKattouw>	 +100, SpiderPig is amazing
[20:06:49] <wikibugs>	 (03Merged) 10jenkins-bot: Route email confirmation funnel through Test Kitchen experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270571 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles)
[20:06:57] <cscott>	 <3 the pig
[20:07:18] <logmsgbot>	 !log mstyles@deploy1003 Started scap sync-world: Backport for [[gerrit:1270571|Route email confirmation funnel through Test Kitchen experiment (T420007)]]
[20:07:22] <stashbot>	 T420007: Measurement plan: Email confirmation banner instrumentation - https://phabricator.wikimedia.org/T420007
[20:07:22] <bvibber>	 mm now i wanna get out my studio ghibli collection and watch porco rosso again
[20:09:08] <logmsgbot>	 !log mstyles@deploy1003 mstyles: Backport for [[gerrit:1270571|Route email confirmation funnel through Test Kitchen experiment (T420007)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:12:52] <logmsgbot>	 !log mstyles@deploy1003 mstyles: Continuing with sync
[20:16:29] <wikibugs>	 (03PS1) 10Dzahn: gerrit: allow zuul machines to port 22 ssh (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1271042
[20:16:43] <logmsgbot>	 !log mstyles@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270571|Route email confirmation funnel through Test Kitchen experiment (T420007)]] (duration: 09m 25s)
[20:16:47] <stashbot>	 T420007: Measurement plan: Email confirmation banner instrumentation - https://phabricator.wikimedia.org/T420007
[20:17:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: allow zuul machines to port 22 ssh (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (owner: 10Dzahn)
[20:17:16] <maryum>	 cscott I think you can go now
[20:17:39] <cscott>	 cool, thanks!
[20:17:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271030 (https://phabricator.wikimedia.org/T422961) (owner: 10C. Scott Ananian)
[20:17:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271038 (https://phabricator.wikimedia.org/T419328) (owner: 10C. Scott Ananian)
[20:21:50] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "it does what it is intended to do - the parameter naming could be confusing though - https://puppet-compiler.wmflabs.org/output/1271017/84" [puppet] - 10https://gerrit.wikimedia.org/r/1271017 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn)
[20:22:30] <wikibugs>	 (03Merged) 10jenkins-bot: ParsoidLanguageConverter: convert inside <indicator> [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271030 (https://phabricator.wikimedia.org/T422961) (owner: 10C. Scott Ananian)
[20:29:48] <wikibugs>	 (03Merged) 10jenkins-bot: LanguageConverter: Allow disabling top-level variant "guess" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271038 (https://phabricator.wikimedia.org/T419328) (owner: 10C. Scott Ananian)
[20:30:12] <logmsgbot>	 !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1271030|ParsoidLanguageConverter: convert inside <indicator> (T422961)]], [[gerrit:1271038|LanguageConverter: Allow disabling top-level variant "guess" (T419328)]]
[20:30:18] <stashbot>	 T422961: LanguageConverter doesn't convert inside <indicator> - https://phabricator.wikimedia.org/T422961
[20:30:18] <stashbot>	 T419328: Legacy LanguageConverter uses top-level ::guessVariant on srwiki - https://phabricator.wikimedia.org/T419328
[20:32:00] <logmsgbot>	 !log cscott@deploy1003 cscott: Backport for [[gerrit:1271030|ParsoidLanguageConverter: convert inside <indicator> (T422961)]], [[gerrit:1271038|LanguageConverter: Allow disabling top-level variant "guess" (T419328)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:32:32] <wikibugs>	 (03PS4) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948)
[20:36:42] <logmsgbot>	 !log cscott@deploy1003 cscott: Continuing with sync
[20:40:31] <logmsgbot>	 !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271030|ParsoidLanguageConverter: convert inside <indicator> (T422961)]], [[gerrit:1271038|LanguageConverter: Allow disabling top-level variant "guess" (T419328)]] (duration: 10m 18s)
[20:40:35] <stashbot>	 T422961: LanguageConverter doesn't convert inside <indicator> - https://phabricator.wikimedia.org/T422961
[20:40:36] <stashbot>	 T419328: Legacy LanguageConverter uses top-level ::guessVariant on srwiki - https://phabricator.wikimedia.org/T419328
[20:40:40] <cscott>	 ok, over to you RoanKattouw 
[20:42:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky)
[20:42:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271001 (https://phabricator.wikimedia.org/T423118) (owner: 10Catrope)
[20:44:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky)
[20:45:28] <wikibugs>	 (03CR) 10Catrope: [C:03+2] Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky)
[20:46:05] <cscott>	 it's always fun when castor-save-workspace-cache fails.
[20:46:32] <wikibugs>	 (03Merged) 10jenkins-bot: Enforce 2FA requirements for phase 1 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271001 (https://phabricator.wikimedia.org/T423118) (owner: 10Catrope)
[20:47:26] <A_smart_kitten>	 that feels like either T419488 or T409479 (depending on whether one's a duplicate of the other), and/or maybe some other task I'm not aware of
[20:47:26] <stashbot>	 T419488: PostBuild changing the status of successful builds to failure for no apparent reason - https://phabricator.wikimedia.org/T419488
[20:47:27] <stashbot>	 T409479: quibble-with-gated-extensions-vendor-mysql-php81 failure due to postbuild jobs - https://phabricator.wikimedia.org/T409479
[20:47:36] <wikibugs>	 (03Merged) 10jenkins-bot: Update wikimaniawiki namespace search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270933 (https://phabricator.wikimedia.org/T423278) (owner: 10Robertsky)
[20:49:30] <logmsgbot>	 !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1270933|Update wikimaniawiki namespace search (T423278)]], [[gerrit:1271001|Enforce 2FA requirements for phase 1 groups (T423118)]]
[20:49:35] <stashbot>	 T423278: update search namespace to 2026, 2027 for wikimaniawiki - https://phabricator.wikimedia.org/T423278
[20:49:36] <stashbot>	 T423118: FY25-26 Q4: Phase 1 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423118
[20:51:22] <logmsgbot>	 !log catrope@deploy1003 catrope, robertsky: Backport for [[gerrit:1270933|Update wikimaniawiki namespace search (T423278)]], [[gerrit:1271001|Enforce 2FA requirements for phase 1 groups (T423118)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:53:09] <logmsgbot>	 !log catrope@deploy1003 catrope, robertsky: Continuing with sync
[20:53:10] <robertsky>	 verified.
[20:56:58] <logmsgbot>	 !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270933|Update wikimaniawiki namespace search (T423278)]], [[gerrit:1271001|Enforce 2FA requirements for phase 1 groups (T423118)]] (duration: 07m 28s)
[20:57:03] <stashbot>	 T423278: update search namespace to 2026, 2027 for wikimaniawiki - https://phabricator.wikimedia.org/T423278
[20:57:03] <stashbot>	 T423118: FY25-26 Q4: Phase 1 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423118
[20:57:10] <RoanKattouw>	 Alright I'm done
[20:57:21] <RoanKattouw>	 bvibber: You're free to do yours nwo
[20:58:45] <robertsky>	 thank you for the assistance!
[20:59:46] <bvibber>	 whee
[21:00:05] <jouncebot>	 Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260414T2100)
[21:01:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270592 (https://phabricator.wikimedia.org/T423173) (owner: 10Bvibber)
[21:01:16] <bvibber>	 it begins
[21:01:25] <bvibber>	 shouldn't take long, it's just config <3
[21:02:28] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ReaderExperiments for itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270592 (https://phabricator.wikimedia.org/T423173) (owner: 10Bvibber)
[21:02:51] <logmsgbot>	 !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1270592|Enable ReaderExperiments for itwiki, plwiki (T423173)]]
[21:02:55] <stashbot>	 T423173: Mobile Page Previews: Launch the experiment - https://phabricator.wikimedia.org/T423173
[21:04:41] <logmsgbot>	 !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1270592|Enable ReaderExperiments for itwiki, plwiki (T423173)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:08:20] <bvibber>	 something's not right
[21:08:48] <logmsgbot>	 !log bvibber@deploy1003 bvibber: Continuing with sync
[21:08:53] <bvibber>	 nope it's right!
[21:08:57] <bvibber>	 i was testing the wrong thing lol
[21:12:39] <logmsgbot>	 !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270592|Enable ReaderExperiments for itwiki, plwiki (T423173)]] (duration: 09m 48s)
[21:12:43] <stashbot>	 T423173: Mobile Page Previews: Launch the experiment - https://phabricator.wikimedia.org/T423173
[21:14:06] <icinga-wm>	 PROBLEM - SSH on wikikube-worker2280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:14:18] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <analytics-privatedata-users> for <Passimacopoulos> - https://phabricator.wikimedia.org/T423301#11821904 (10andrea.denisse)
[21:16:13] <bvibber>	 oh yay mine's all done :D
[21:16:20] <bvibber>	 anybody left or are we on to the next window :D
[21:18:03] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to <analytics-privatedata-users> for <Passimacopoulos> - https://phabricator.wikimedia.org/T423301#11821907 (10andrea.denisse)
[21:18:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2280.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2280.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[21:22:14] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to <analytics-privatedata-users> for <Passimacopoulos> - https://phabricator.wikimedia.org/T423301#11821928 (10andrea.denisse) Hi @Passimacopoulos, while I work on your request I need to share with you the [[ https://wikitech.wikimedia.org/w...
[21:24:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2280:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2280 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:39:18] <icinga-wm>	 PROBLEM - Host wikikube-worker2280 is DOWN: PING CRITICAL - Packet loss = 100%
[21:39:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2280:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2280 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:41:48] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] Add wikikube-ctrl2004 and wikikube-ctrl2005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1218351 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine)
[22:43:16] <logmsgbot>	 !log jasmine@dns1004 START - running authdns-update
[22:44:44] <logmsgbot>	 !log jasmine@dns1004 END - running authdns-update
[23:09:03] <wikibugs>	 (03Abandoned) 10Andrew Bogott: nova vendordata: disable unattended upgrades in base image [puppet] - 10https://gerrit.wikimedia.org/r/1269056 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott)
[23:11:28] <Amir1>	 !log optimizing globalblocks table on s7 (T423349)
[23:11:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:31] <stashbot>	 T423349: globalblocks query randomly becomes slow - https://phabricator.wikimedia.org/T423349
[23:14:15] <jinxer-wm>	 FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 13.76% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[23:16:07] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1026-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:19:15] <jinxer-wm>	 RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 13.2% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[23:36:01] <wikibugs>	 (03PS1) 10Eevans: linked-artifacts: update staging to v1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271083 (https://phabricator.wikimedia.org/T414838)
[23:36:03] <wikibugs>	 (03PS2) 10Cwhite: beta-logs: provision ca on cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1269509 (https://phabricator.wikimedia.org/T350516)
[23:39:13] <wikibugs>	 (03PS1) 10Ladsgroup: Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637)
[23:39:26] <wikibugs>	 (03PS1) 10Ladsgroup: Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271086 (https://phabricator.wikimedia.org/T412637)
[23:39:35] <wikibugs>	 (03PS3) 10Cwhite: beta-logs: provision ca on cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1269509 (https://phabricator.wikimedia.org/T350516)
[23:39:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup)
[23:39:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271086 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup)
[23:39:45] <wikibugs>	 (03PS4) 10Cwhite: beta-logs: provision ca on cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1269509 (https://phabricator.wikimedia.org/T350516)
[23:39:53] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1271088
[23:39:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1271088 (owner: 10TrainBranchBot)
[23:41:07] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1026-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:45:09] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] beta-logs: provision ca on cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1269509 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite)
[23:54:11] <wikibugs>	 (03Merged) 10jenkins-bot: Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271086 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup)
[23:55:25] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:55:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup)
[23:55:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:57:08] <wikibugs>	 (03CR) 10Ladsgroup: "try again" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup)
[23:57:30] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup)
[23:58:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1271088 (owner: 10TrainBranchBot)