[00:03:56] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:05:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: kernel-purge.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:05:55] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:11:50] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:16:50] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:18:51] <jinxer-wm>	 FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf  - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[00:19:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:19:33] <denisse>	 !imm
[00:19:41] <denisse>	 !incidents
[00:19:42] <sirenbot>	 8035 (ACKED)  [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw)
[00:19:42] <sirenbot>	 8034 (RESOLVED)  OutboundMXQueueHigh sre (mx-out1001:9154 eqiad)
[00:23:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:24:32] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:27:38] <icinga-wm>	 PROBLEM - MD RAID on centrallog1002 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[00:27:39] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on centrallog1002 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T427748 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[00:27:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748 (10ops-monitoring-bot) 03NEW
[00:53:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:58:45] <jinxer-wm>	 RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:58:51] <jinxer-wm>	 FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf  - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[01:03:40] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[01:03:42] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6016 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[01:09:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[01:09:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1295788
[01:09:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1295788 (owner: 10TrainBranchBot)
[01:10:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:10:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:11:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:11:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:13:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ...
[01:13:51] <jinxer-wm>	 IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[01:18:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1295788 (owner: 10TrainBranchBot)
[01:20:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ...
[01:20:51] <jinxer-wm>	 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[01:22:02] <denisse>	 !incidents
[01:22:03] <sirenbot>	 8036 (ACKED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw)
[01:22:03] <sirenbot>	 8035 (RESOLVED)  [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw)
[01:22:03] <sirenbot>	 8034 (RESOLVED)  OutboundMXQueueHigh sre (mx-out1001:9154 eqiad)
[01:25:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ...
[01:25:51] <jinxer-wm>	 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[01:36:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ...
[01:36:51] <jinxer-wm>	 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[01:37:07] <denisse>	 !ack 8037
[01:37:07] <sirenbot>	 8037 (ACKED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw)
[01:55:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11970439 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty will check to see what is available from decom servers
[01:56:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ...
[01:56:51] <jinxer-wm>	 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[02:08:56] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:33:56] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:13:06] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:13:55] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[04:05:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: kernel-purge.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:19:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:24:32] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:09:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:47:52] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti1028 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[05:48:52] <icinga-wm>	 RECOVERY - ganeti-noded running on ganeti1028 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[06:44:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Mark the wikidough ports as intentionally open to the world [puppet] - 10https://gerrit.wikimedia.org/r/1295431 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[06:47:13] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "That's a lot of groups" [puppet] - 10https://gerrit.wikimedia.org/r/1295467 (owner: 10Muehlenhoff)
[06:47:51] <wikibugs>	 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - https://phabricator.wikimedia.org/T427469#11970599 (10Marostegui) There seem to be way more alerts with this problem {F85870686}
[06:47:58] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[06:48:34] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[07:00:05] <jouncebot>	 Amir1, urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T0700).
[07:00:05] <jouncebot>	 WMDE-Fisch, atsukoito, and xxb: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:17] <WMDE-Fisch>	 \o
[07:00:21] <xxb>	 nyaa
[07:00:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11970613 (10MoritzMuehlenhoff)
[07:00:51] <WMDE-Fisch>	 I'll self serve and start with my stuff
[07:01:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch)
[07:01:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "We will need to redeploy everything using these fields. Could you ping me when this is merged, I'll redeploy all kerberized kubernetes app" [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) (owner: 10Elukey)
[07:01:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch the pki:root role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294958 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[07:06:01] <WMDE-Fisch>	 xxb: I see on the ticket to your patch that there's a comment to wait at least a week before you should come to a conclusion. 🤔
[07:06:59] <xxb>	 i mean 16 support 0 oppose. i can wait but fine.
[07:07:15] <WMDE-Fisch>	 True ;-)
[07:07:27] <WMDE-Fisch>	 I just saw that as well.
[07:10:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch rpkivalidator role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294930 (owner: 10Muehlenhoff)
[07:10:46] <icinga-wm>	 PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100%
[07:13:01] <wikibugs>	 (03Merged) 10jenkins-bot: Update VE core submodule to master (9cf5524e7) [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch)
[07:13:07] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:13:14] <icinga-wm>	 RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[07:13:38] <logmsgbot>	 !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1294826|Update VE core submodule to master (9cf5524e7) (T424232)]]
[07:13:41] <stashbot>	 T424232: VisualDiff does not show change of a main+details edit - https://phabricator.wikimedia.org/T424232
[07:13:55] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[07:14:46] <wikibugs>	 (03PS1) 10Brouberol: test-kitchen: reach out to the growthbook-api through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570)
[07:19:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:20:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
[07:23:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
[07:25:00] <WMDE-Fisch>	 Hmmm building the containers takes quite long....
[07:25:33] <WMDE-Fisch>	 But I also don't see anything in the logs.
[07:26:23] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] cache::haproxy: limit email addresses to reasonable lengths [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto)
[07:27:14] <WMDE-Fisch>	 Ah now it works ^^'
[07:27:59] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11970653 (10ayounsi) `--move-vlan` is only made to migrate core DCs from legacy to new per rack vlans. Let me know if its worth spending...
[07:28:25] <wikibugs>	 (03CR) 10Fabfur: "Will this requires a general haproxykafka roll-restart?" [puppet] - 10https://gerrit.wikimedia.org/r/1295020 (owner: 10Elukey)
[07:28:56] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:31:22] <logmsgbot>	 !log wmde-fisch@deploy1003 wmde-fisch: Backport for [[gerrit:1294826|Update VE core submodule to master (9cf5524e7) (T424232)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:31:26] <stashbot>	 T424232: VisualDiff does not show change of a main+details edit - https://phabricator.wikimedia.org/T424232
[07:31:37] <WMDE-Fisch>	 Testing
[07:32:06] <logmsgbot>	 !log wmde-fisch@deploy1003 wmde-fisch: Continuing with deployment
[07:34:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295368 (https://phabricator.wikimedia.org/T426764) (owner: 10Brouberol)
[07:35:27] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: enable sync pods to egress to our s3 endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295368 (https://phabricator.wikimedia.org/T426764) (owner: 10Brouberol)
[07:38:43] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[07:38:49] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[07:38:56] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:40:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:40:48] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:41:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (gre) (185.15.58.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad+%28gre%29 - https://alerts.wikimedia.org/?q=alertname%3DCoreB
[07:41:57] <WMDE-Fisch>	 Deployment is somehow soooo slow today....
[07:42:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[07:42:42] <xxb>	 its monday morning for them too ¯\_(ツ)_/¯
[07:42:56] <icinga-wm>	 PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[07:43:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11970702 (10MatthewVernon) @VRiley-WMF sure; backend's can't be meaningfully depooled, so it'd be a case of "do one, check everything has recovered OK, move on to the next". [not sure if it's ea...
[07:44:30] <WMDE-Fisch>	 xxb: I fear we won't make your change ... there's this other config patch I need to merge still :-/
[07:44:46] <icinga-wm>	 RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[07:45:00] <WMDE-Fisch>	 You could put it into the afternoon slot though.
[07:45:11] <xxb>	 sure ill try get it this evening or tomorrow
[07:45:13] <logmsgbot>	 !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294826|Update VE core submodule to master (9cf5524e7) (T424232)]] (duration: 31m 34s)
[07:45:16] <stashbot>	 T424232: VisualDiff does not show change of a main+details edit - https://phabricator.wikimedia.org/T424232
[07:45:18] <wikibugs>	 (03CR) 10Santiago Faci: test-kitchen: reach out to the growthbook-api through the mesh (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol)
[07:45:23] <xxb>	 afternoon ill have to do other irl stuff ;/
[07:45:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295454 (https://phabricator.wikimedia.org/T427484) (owner: 10Svantje Lilienthal)
[07:46:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: ...
[07:46:39] <jinxer-wm>	 Core BGP session down between cr1-drmrs and cr2-eqiad (gre) (185.15.58.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad+%28gre%29 - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[07:47:03] <wikibugs>	 (03PS2) 10Muehlenhoff: mirrors: Disable osbpo sync [puppet] - 10https://gerrit.wikimedia.org/r/1294980 (https://phabricator.wikimedia.org/T416707)
[07:47:14] <wikibugs>	 (03Merged) 10jenkins-bot: Disable the creation of synthetic main refs in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295454 (https://phabricator.wikimedia.org/T427484) (owner: 10Svantje Lilienthal)
[07:47:31] <logmsgbot>	 !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1295454|Disable the creation of synthetic main refs in production (T427484)]]
[07:47:34] <stashbot>	 T427484: Disable the creation of synthetic main refs in production - https://phabricator.wikimedia.org/T427484
[07:49:43] <dcausse>	 atsukoito: seems like there won't be any enough time left for the ttm config change this morning :/
[07:50:05] <dcausse>	 jouncebot: next
[07:50:06] <jouncebot>	 In 2 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1000)
[07:50:40] <dcausse>	 well if it's OK we could possibly extend the backport window?
[07:51:06] <atsukoito>	 dcausse: let's move it further then
[07:51:18] <atsukoito>	 tuesday?
[07:51:22] <logmsgbot>	 !log wmde-fisch@deploy1003 lilients, wmde-fisch: Backport for [[gerrit:1295454|Disable the creation of synthetic main refs in production (T427484)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:51:35] <dcausse>	 atsukoito: sounds good, tuesday same time
[07:51:58] <atsukoito>	 I'll update the patch/page, thanks
[07:52:03] <dcausse>	 thanks!
[07:52:34] <logmsgbot>	 !log wmde-fisch@deploy1003 lilients, wmde-fisch: Continuing with deployment
[07:52:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294949 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[07:54:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Bitu: Switch to idm-sre-approval@wikimedia.org for notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295467 (owner: 10Muehlenhoff)
[07:56:13] <XioNoX>	 !log add no_p2p term to pfw1-codfw  BGP_fundraising_export - T423384
[07:56:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:17] <stashbot>	 T423384: Investigate internal rejected prefixes - https://phabricator.wikimedia.org/T423384
[07:57:58] <wikibugs>	 (03PS2) 10Brouberol: test-kitchen: reach out to the growthbook-api through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570)
[07:58:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295531 (https://phabricator.wikimedia.org/T427384) (owner: 10XXBlackburnXx)
[07:58:58] <logmsgbot>	 !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295454|Disable the creation of synthetic main refs in production (T427484)]] (duration: 11m 26s)
[07:59:01] <stashbot>	 T427484: Disable the creation of synthetic main refs in production - https://phabricator.wikimedia.org/T427484
[07:59:31] <WMDE-Fisch>	 Deployments done.
[08:00:11] <wikibugs>	 (03CR) 10Brouberol: test-kitchen: reach out to the growthbook-api through the mesh (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol)
[08:05:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: kernel-purge.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:06:20] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Raise SiteVerify error threshold to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295802
[08:08:24] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] test-kitchen: reach out to the growthbook-api through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol)
[08:11:45] <wikibugs>	 (03Merged) 10jenkins-bot: test-kitchen: reach out to the growthbook-api through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol)
[08:12:36] <wikibugs>	 06SRE, 10SRE-swift-storage: tests for wmf/rewrite.py should use standard thumbnail size (and should also work) - https://phabricator.wikimedia.org/T427757 (10MatthewVernon) 03NEW
[08:13:31] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "No objection, although...did it really not work the whole time?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński)
[08:13:48] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:13:48] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:16:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: ...
[08:16:39] <jinxer-wm>	 Core BGP session down between cr1-drmrs and cr2-eqiad (gre) (185.15.58.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad+%28gre%29 - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[08:17:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:18:42] <wikibugs>	 (03PS1) 10MVernon: rewrite_integration: use a standard thumbnail size [puppet] - 10https://gerrit.wikimedia.org/r/1295804 (https://phabricator.wikimedia.org/T427757)
[08:20:20] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: tests for wmf/rewrite.py should use standard thumbnail size (and should also work) - https://phabricator.wikimedia.org/T427757#11970801 (10MatthewVernon)
[08:20:24] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11970802 (10MatthewVernon)
[08:24:32] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:24:35] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[08:24:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T426633)', diff saved to https://phabricator.wikimedia.org/P93405 and previous config saved to /var/cache/conftool/dbconfig/20260601-082442-fceratto.json
[08:31:31] <wikibugs>	 (03PS1) 10Ayounsi: Add RejectingBGPPrefixes alert [alerts] - 10https://gerrit.wikimedia.org/r/1295805 (https://phabricator.wikimedia.org/T423384)
[08:31:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T426633)', diff saved to https://phabricator.wikimedia.org/P93406 and previous config saved to /var/cache/conftool/dbconfig/20260601-083146-fceratto.json
[08:32:08] <wikibugs>	 (03CR) 10JMeybohm: ratelimit-media: policy and user-class level metrics (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert)
[08:33:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 380533912 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[08:33:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add RejectingBGPPrefixes alert [alerts] - 10https://gerrit.wikimedia.org/r/1295805 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi)
[08:34:38] <wikibugs>	 (03PS1) 10Muehlenhoff: node20-slim: Fix image build [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295847
[08:34:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2744880 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[08:35:14] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] kafka-main2010: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288921 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine)
[08:35:25] <wikibugs>	 (03PS2) 10Ayounsi: Add RejectingBGPPrefixes alert [alerts] - 10https://gerrit.wikimedia.org/r/1295805 (https://phabricator.wikimedia.org/T423384)
[08:36:39] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Two nits but feel free to ignore" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert)
[08:36:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11970923 (10mahmoud.abdelsattar.wmde) Dear @Dzahn .. I've confirmed the SSH key with my email. Thanks a lot!
[08:40:46] <icinga-wm>	 PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100%
[08:41:49] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295430 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert)
[08:41:55] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P93407 and previous config saved to /var/cache/conftool/dbconfig/20260601-084154-fceratto.json
[08:43:14] <icinga-wm>	 RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms
[08:50:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Don't announce OSPF routes in unicast BGP on Nokia SR-Linux - https://phabricator.wikimedia.org/T423430#11971143 (10ayounsi) Once this is fixed we can remove `|ibgp` from the [[ https://gerrit.wikimedia.org/r/c/operations/alerts/+/1295805 | RejectingBGPPrefixes...
[08:52:03] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P93408 and previous config saved to /var/cache/conftool/dbconfig/20260601-085202-fceratto.json
[08:52:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:52:26] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-serve: update eqiad kserve/knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295850
[08:52:31] <wikibugs>	 (03CR) 10FNegri: "Is T426804 a blocker for this?" [puppet] - 10https://gerrit.wikimedia.org/r/1281756 (https://phabricator.wikimedia.org/T425191) (owner: 10Zabe)
[08:57:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:59:55] <wikibugs>	 (03CR) 10CWilliams: [C:03+1] sre.mysql.pool: Support depooling unreachable hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1294265 (https://phabricator.wikimedia.org/T427381) (owner: 10Federico Ceratto)
[09:02:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T426633)', diff saved to https://phabricator.wikimedia.org/P93409 and previous config saved to /var/cache/conftool/dbconfig/20260601-090209-fceratto.json
[09:02:30] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[09:02:38] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93410 and previous config saved to /var/cache/conftool/dbconfig/20260601-090237-fceratto.json
[09:04:37] <wikibugs>	 (03CR) 10CWilliams: cookbooks/sre/mysql/decommission: add cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto)
[09:05:07] <wikibugs>	 (03PS1) 10Jelto: miscweb: update wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295852 (https://phabricator.wikimedia.org/T414405)
[09:08:46] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: update wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295852 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[09:09:13] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: cache::haproxy: support wikilink style usernames in UAs [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992)
[09:09:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:11:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279#11971248 (10atsuko) 05In progress→03Resolved a:03atsuko Needed to create kerberos principal that matches the uni...
[09:11:36] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade
[09:11:57] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1177: Upgrading db1177.eqiad.wmnet
[09:12:36] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1177: Upgrading db1177.eqiad.wmnet
[09:12:50] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[09:13:22] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[09:14:25] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[09:15:22] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[09:16:45] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.pool: Support depooling unreachable hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1294265 (https://phabricator.wikimedia.org/T427381) (owner: 10Federico Ceratto)
[09:16:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting Access to Analytics Data Lake for Dlulisa-WMF - https://phabricator.wikimedia.org/T427197#11971265 (10atsuko) Could you please re-check that you have the access to the tables if you do `kinit wmf-ldlulisa`.
[09:17:55] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1177.eqiad.wmnet with OS trixie
[09:18:27] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "LGTM, just reviewing the description and checking the CI ran successfully" [puppet] - 10https://gerrit.wikimedia.org/r/1295804 (https://phabricator.wikimedia.org/T427757) (owner: 10MVernon)
[09:20:48] <wikibugs>	 (03Merged) 10jenkins-bot: sre.mysql.pool: Support depooling unreachable hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1294265 (https://phabricator.wikimedia.org/T427381) (owner: 10Federico Ceratto)
[09:24:42] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Raise SiteVerify error threshold to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295802 (owner: 10Kosta Harlan)
[09:28:49] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1295859 (https://phabricator.wikimedia.org/T427345)
[09:29:29] <wikibugs>	 (03PS1) 10Atsuko: flink: updating control to jdk21 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295860 (https://phabricator.wikimedia.org/T427774)
[09:29:51] <wikibugs>	 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog: Investigate Code 414 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11971334 (10MLechvien-WMF) p:05Medium→03Low
[09:31:02] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1177.eqiad.wmnet with reason: host reimage
[09:31:08] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] flink: updating control to jdk21 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295860 (https://phabricator.wikimedia.org/T427774) (owner: 10Atsuko)
[09:31:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting Access to Analytics Data Lake for Dlulisa-WMF - https://phabricator.wikimedia.org/T427197#11971338 (10atsuko) 05In progress→03Invalid Confirmed that the access is already present, no change needed.
[09:33:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1295859 (https://phabricator.wikimedia.org/T427345) (owner: 10Marostegui)
[09:34:25] <wikibugs>	 10SRE-tools, 06DBA, 10Spicerack: Provide downtime duration information in sre.mysql cookbooks - https://phabricator.wikimedia.org/T427780#11971342 (10Marostegui) @elukey any input on this?
[09:34:29] <wikibugs>	 (03CR) 10JavierMonton: [C:03+1] flink: updating control to jdk21 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295860 (https://phabricator.wikimedia.org/T427774) (owner: 10Atsuko)
[09:34:34] <wikibugs>	 (03CR) 10MVernon: [C:03+2] rewrite_integration: use a standard thumbnail size [puppet] - 10https://gerrit.wikimedia.org/r/1295804 (https://phabricator.wikimedia.org/T427757) (owner: 10MVernon)
[09:34:34] <wikibugs>	 10SRE-tools, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Provide downtime duration information in sre.mysql cookbooks - https://phabricator.wikimedia.org/T427780#11971346 (10Marostegui) p:05Triage→03Medium
[09:34:40] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1295859 (https://phabricator.wikimedia.org/T427345) (owner: 10Marostegui)
[09:34:43] <wikibugs>	 (03CR) 10Atsuko: [V:03+2 C:03+2] flink: updating control to jdk21 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295860 (https://phabricator.wikimedia.org/T427774) (owner: 10Atsuko)
[09:35:07] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: host reimage
[09:35:27] <wikibugs>	 10SRE-tools, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Provide downtime duration information in sre.mysql cookbooks - https://phabricator.wikimedia.org/T427780#11971353 (10Marostegui)
[09:37:43] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade
[09:37:59] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: tests for wmf/rewrite.py should use standard thumbnail size (and should also work) - https://phabricator.wikimedia.org/T427757#11971355 (10MatthewVernon) 05Open→03Resolved ` mvernon@ms-fe1009:~$ python3 /usr/local/lib/python3.9/dist-packages/wmf/rewrite...
[09:38:03] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1055: Upgrading es1055.eqiad.wmnet
[09:38:23] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1055: Upgrading es1055.eqiad.wmnet
[09:39:11] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1055.eqiad.wmnet with OS trixie
[09:39:19] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] profile::kafka: remove kafka_11 profile occurrences (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1295022 (owner: 10Elukey)
[09:40:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:41:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:42:53] <wikibugs>	 (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295861
[09:45:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: kernel-purge.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:49:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295861 (owner: 10Muehlenhoff)
[09:50:34] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply
[09:51:21] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply
[09:51:39] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1177.eqiad.wmnet with OS trixie
[09:53:46] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1055.eqiad.wmnet with reason: host reimage
[09:54:40] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply
[09:56:05] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply
[09:58:41] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1055.eqiad.wmnet with reason: host reimage
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1000)
[10:00:29] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1177: Migration of db1177.eqiad.wmnet completed
[10:02:53] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93414 and previous config saved to /var/cache/conftool/dbconfig/20260601-100252-fceratto.json
[10:03:20] <wikibugs>	 (03CR) 10Majavah: [C:03+2] firewall::client: Fix default for qos [puppet] - 10https://gerrit.wikimedia.org/r/1294948 (owner: 10Majavah)
[10:07:27] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply
[10:08:23] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "It is not used." [puppet] - 10https://gerrit.wikimedia.org/r/1294248 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[10:09:57] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[10:13:01] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P93415 and previous config saved to /var/cache/conftool/dbconfig/20260601-101300-fceratto.json
[10:13:56] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 43591296 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:14:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 154872 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:15:05] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1055.eqiad.wmnet with OS trixie
[10:15:57] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99)
[10:16:18] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1055: repool after upgrade
[10:16:55] <wikibugs>	 (03CR) 10JMeybohm: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey)
[10:21:12] <wikibugs>	 (03PS1) 10Muehlenhoff: thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1295866
[10:23:09] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P93418 and previous config saved to /var/cache/conftool/dbconfig/20260601-102308-fceratto.json
[10:25:09] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11971500 (10cmooney) >>! In T427393#11970653, @ayounsi wrote: > `--move-vlan` is only made to migrate core DCs from legacy to new per rac...
[10:25:43] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Create a new role for the dse-k8s nodes that are dedicated to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis)
[10:31:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1295866 (owner: 10Muehlenhoff)
[10:32:11] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1295866 (owner: 10Muehlenhoff)
[10:33:16] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93421 and previous config saved to /var/cache/conftool/dbconfig/20260601-103316-fceratto.json
[10:34:00] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy2003.codfw.wmnet
[10:35:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] dbproxy: Remove unused public type [puppet] - 10https://gerrit.wikimedia.org/r/1294248 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[10:36:02] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::mariadb::proxy: Use Puppet types [puppet] - 10https://gerrit.wikimedia.org/r/1294258
[10:36:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11971558 (10MoritzMuehlenhoff)
[10:39:42] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294258 (owner: 10Muehlenhoff)
[10:40:18] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2003.codfw.wmnet
[10:45:19] <Dreamy_Jazz>	 jouncebot: nowandnext
[10:45:19] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1000)
[10:45:19] <jouncebot>	 In 2 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1300)
[10:45:26] <Dreamy_Jazz>	 Anyone mind me deploying?
[10:45:59] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1177: Migration of db1177.eqiad.wmnet completed
[10:46:00] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0)
[10:47:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es2055 to es1 codfw primary T427032', diff saved to https://phabricator.wikimedia.org/P93424 and previous config saved to /var/cache/conftool/dbconfig/20260601-104739-marostegui.json
[10:47:43] <stashbot>	 T427032: Migrate es1 section to Debian Trixie - https://phabricator.wikimedia.org/T427032
[10:48:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1050 to es1 eqiad primary T427032', diff saved to https://phabricator.wikimedia.org/P93425 and previous config saved to /var/cache/conftool/dbconfig/20260601-104837-marostegui.json
[10:49:16] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update es1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1295872 (https://phabricator.wikimedia.org/T427032)
[10:51:13] <wikibugs>	 (03CR) 10Majavah: [C:04-1] designate: remove leftover mcrouter code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T427189) (owner: 10Andrew Bogott)
[10:52:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update es1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1295872 (https://phabricator.wikimedia.org/T427032) (owner: 10Marostegui)
[10:52:25] <logmsgbot>	 !log marostegui@dns1004 START - running authdns-update
[10:54:09] <logmsgbot>	 !log marostegui@dns1004 END - running authdns-update
[10:56:03] <Dreamy_Jazz>	 testing deployment...
[10:57:46] <wikibugs>	 (03PS1) 10Muehlenhoff: thumbor: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295873
[11:00:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] thumbor: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295873 (owner: 10Muehlenhoff)
[11:01:14] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[11:01:22] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93427 and previous config saved to /var/cache/conftool/dbconfig/20260601-110121-fceratto.json
[11:01:43] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1055: repool after upgrade
[11:04:58] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
[11:06:12] <wikibugs>	 (03PS1) 10Btullis: Fix the creation of the vg_raid0 volume on dse-k8s-wdqs-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653)
[11:08:20] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93429 and previous config saved to /var/cache/conftool/dbconfig/20260601-110820-fceratto.json
[11:08:38] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis)
[11:10:03] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] profile::mariadb::proxy: Use Puppet types [puppet] - 10https://gerrit.wikimedia.org/r/1294258 (owner: 10Muehlenhoff)
[11:10:58] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[11:11:06] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] api-gateway: Pre-teardown deprecation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294957 (https://phabricator.wikimedia.org/T426881) (owner: 10Clément Goubert)
[11:11:34] <wikibugs>	 (03CR) 10Atsuko: [C:03+1] Fix the creation of the vg_raid0 volume on dse-k8s-wdqs-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis)
[11:12:09] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:13:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11971652 (10VRiley-WMF) 05Open→03In progress
[11:14:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11971653 (10VRiley-WMF) Updating BIOS
[11:14:22] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:14:31] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql: Auto-lint imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1293666 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto)
[11:14:32] <wikibugs>	 (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: increase downtime for backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1295440 (https://phabricator.wikimedia.org/T427614) (owner: 10Jelto)
[11:15:27] <wikibugs>	 (03PS2) 10Btullis: Fix the creation of the vg_raid0 volume on dse-k8s-wdqs-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653)
[11:16:11] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis)
[11:16:43] <wikibugs>	 (03CR) 10Kamila Součková: "LGTM, I'll merge when ready to proceed. Thank you!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295044 (owner: 10Scott French)
[11:17:08] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[11:18:28] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P93430 and previous config saved to /var/cache/conftool/dbconfig/20260601-111827-fceratto.json
[11:19:11] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab.upgrade: increase downtime for backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1295440 (https://phabricator.wikimedia.org/T427614) (owner: 10Jelto)
[11:19:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:21:00] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[11:21:09] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[11:22:37] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[11:22:45] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[11:22:51] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[11:22:53] <moritzm>	 !log installing imagemagick security updates
[11:22:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11971669 (10VRiley-WMF) BIOS is now at 1.21.1 (previous was 1.12.1). Moving onto iDRAC
[11:23:43] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[11:23:48] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[11:24:31] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "Thank you <3" [puppet] - 10https://gerrit.wikimedia.org/r/1295057 (owner: 10Scott French)
[11:24:39] <wikibugs>	 (03PS1) 10Mszwarc: Add SetGlobalPreference maintenance script [extensions/GlobalPreferences] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295875 (https://phabricator.wikimedia.org/T427476)
[11:25:16] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[11:28:36] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P93432 and previous config saved to /var/cache/conftool/dbconfig/20260601-112835-fceratto.json
[11:28:45] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[11:28:50] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[11:29:02] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "Yup, /bin/nodejs is in the package file list. Thank you!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295847 (owner: 10Muehlenhoff)
[11:29:49] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release wdqs-internal/main on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:32:09] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[11:32:39] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[11:32:49] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[11:33:25] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[11:33:28] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[11:34:42] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[11:34:49] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release wdqs-internal/main on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:36:29] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[11:37:50] <moritzm>	 !log installing Exim security updates
[11:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11971696 (10VRiley-WMF) iDRAC has been completed. moving onto Non-expander storage backplane
[11:38:44] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93433 and previous config saved to /var/cache/conftool/dbconfig/20260601-113843-fceratto.json
[11:39:03] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[11:39:11] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93434 and previous config saved to /var/cache/conftool/dbconfig/20260601-113911-fceratto.json
[11:46:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11971700 (10VRiley-WMF) Firmware (BIOS, iDRAC and Non-expander storage backplane) have been updated (I thought they were up to date before, but new information was pointed out to me). Through iDRAC I can...
[11:49:46] <wikibugs>	 (03CR) 10Ladsgroup: "yeah probably but also I rather wait we stop writing to the old tables in production since that's going to give users a bit more time." [puppet] - 10https://gerrit.wikimedia.org/r/1281756 (https://phabricator.wikimedia.org/T425191) (owner: 10Zabe)
[11:50:14] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-serve: update eqiad kserve/knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295850 (owner: 10Dpogorzelski)
[11:52:19] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] profile::mariadb::proxy: Use Puppet types [puppet] - 10https://gerrit.wikimedia.org/r/1294258 (owner: 10Muehlenhoff)
[11:55:58] <wikibugs>	 (03CR) 10Ladsgroup: "We could also enable it for ten minutes and then revert it. It's not as great as enabling it gradually but it could work for most purposes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[11:59:30] <logmsgbot>	 !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in eqiad/ml-serve-eqiad: maintenance
[11:59:30] <logmsgbot>	 !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in eqiad/ml-serve-eqiad: maintenance
[12:02:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:03:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:04:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet
[12:04:39] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2027.codfw.wmnet
[12:04:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet
[12:05:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:05:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet
[12:06:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:07:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:07:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2005.codfw.wmnet to drbd
[12:09:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:11:28] <logmsgbot>	 !log dpogorzelski@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=inference,name=eqiad
[12:13:57] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Update recommendation-api-ng memory limit to 2Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295880
[12:15:15] <logmsgbot>	 !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in eqiad/ml-serve-eqiad: maintenance
[12:15:50] <logmsgbot>	 !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in eqiad/ml-serve-eqiad: maintenance
[12:16:41] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: Update recommendation-api-ng memory limit to 2Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295880 (owner: 10Bartosz Wójtowicz)
[12:17:38] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update recommendation-api-ng memory limit to 2Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295880 (owner: 10Bartosz Wójtowicz)
[12:17:54] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[12:18:51] <wikibugs>	 (03CR) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert)
[12:18:57] <wikibugs>	 (03PS2) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051)
[12:19:57] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Update recommendation-api-ng memory limit to 2Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295880 (owner: 10Bartosz Wójtowicz)
[12:20:12] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[12:21:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/GlobalPreferences] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295875 (https://phabricator.wikimedia.org/T427476) (owner: 10Mszwarc)
[12:22:13] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] ratelimit: Add CACHE_KEY_PREFIX configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert)
[12:22:28] <wikibugs>	 (03PS3) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051)
[12:22:40] <wikibugs>	 (03CR) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert)
[12:23:29] <wikibugs>	 (03PS4) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051)
[12:23:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2005.codfw.wmnet to drbd
[12:24:32] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:25:12] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:25:26] <wikibugs>	 (03PS3) 10Btullis: Fix the creation of the vg_raid0 volume on dse-k8s-wdqs-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653)
[12:26:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet
[12:26:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet
[12:26:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2005.codfw.wmnet to plain
[12:27:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2005.codfw.wmnet to plain
[12:27:51] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:28:18] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] node20-slim: Fix image build [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295847 (owner: 10Muehlenhoff)
[12:28:31] <wikibugs>	 (03PS5) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051)
[12:28:36] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:28:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet
[12:29:20] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:30:12] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:32:02] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295888
[12:35:07] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[12:35:44] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[12:39:27] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93435 and previous config saved to /var/cache/conftool/dbconfig/20260601-123926-fceratto.json
[12:41:49] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[12:42:10] <wikibugs>	 (03CR) 10Clément Goubert: ratelimit-media: policy and user-class level metrics (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert)
[12:42:13] <wikibugs>	 (03PS2) 10Clément Goubert: ratelimit-media: policy and user-class level metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051)
[12:42:44] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[12:43:25] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[12:44:02] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[12:44:25] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Update outlink-topic-model docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295899 (https://phabricator.wikimedia.org/T418493)
[12:46:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface lswtest-d8-eqiad:mgmt0 () - https://phabricator.wikimedia.org/T415109#11971857 (10VRiley-WMF) @Papaul out of curiousity, should we still be keeping this ticket open? Or is it safe to close out now?
[12:46:54] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[12:47:07] <wikibugs>	 (03PS1) 10Atsuko: service: services_proxy: prod opensearch-on-k8s services [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248)
[12:47:31] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[12:47:36] <wikibugs>	 (03CR) 10Majavah: [C:03+1] toolforge::elasticsearch::haproxy: Restrict to cloud network [puppet] - 10https://gerrit.wikimedia.org/r/1295398 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[12:47:39] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: cache::haproxy: support wikilink style usernames in UAs [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992)
[12:47:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cache::haproxy: remove the lua_contact_info feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1295902 (https://phabricator.wikimedia.org/T414300)
[12:48:39] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[12:49:28] <wikibugs>	 (03CR) 10Majavah: [C:04-1] "This profile is used to configure the Cloud VPS outbound email relays (`mx-out*.cloudinfra.eqiad1.wikimedia.cloud`) which need to accept o" [puppet] - 10https://gerrit.wikimedia.org/r/1284671 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[12:49:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93436 and previous config saved to /var/cache/conftool/dbconfig/20260601-124934-fceratto.json
[12:49:35] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[12:50:54] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[12:52:24] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Hmm, on a closer look, it worked *some* of the time. It doesn't work today, and it didn't work when it was added in October 2024 (change 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński)
[12:52:25] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[12:53:32] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Revert "labswiki: Disallow account autocreation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński)
[12:55:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: support wikilink style usernames in UAs [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992) (owner: 10Giuseppe Lavagetto)
[12:55:27] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[12:55:30] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[12:55:34] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[12:55:38] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[12:55:41] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Revert "labswiki: Disallow account autocreation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502
[12:55:41] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[12:55:41] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: Update outlink-topic-model docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295899 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz)
[12:55:44] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[12:55:46] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] service: move eventstreams-internal to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[12:55:47] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[12:55:52] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[12:55:55] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' .
[12:55:58] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:56:01] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' .
[12:56:04] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' .
[12:56:06] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "I think this commit message explains the situation better, thanks for prompting me to investigate." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński)
[12:56:07] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[12:56:10] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[12:56:13] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[12:56:16] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[12:56:19] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[12:56:19] <logmsgbot>	 !log dpogorzelski@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=inference,name=eqiad
[12:57:32] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] "merging with fabfur" [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[12:58:51] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::kubeadm::etcd: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295905 (https://phabricator.wikimedia.org/T427799)
[12:58:58] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::kubeadm::etcd: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295905 (https://phabricator.wikimedia.org/T427799)
[12:59:40] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis)
[12:59:42] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93437 and previous config saved to /var/cache/conftool/dbconfig/20260601-125941-fceratto.json
[13:00:05] <jouncebot>	 Lucas_WMDE, urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1300)
[13:00:05] <jouncebot>	 codenamenoreste and Msz2001: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <Msz2001>	 o/
[13:00:22] <Lucas_WMDE>	 I can’t deploy, in a meeting
[13:00:30] <Msz2001>	 I can deploy the patches. codenamenoreste: shall I start with yours?
[13:00:57] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8621/co" [puppet] - 10https://gerrit.wikimedia.org/r/1295905 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah)
[13:01:00] <atsukoito>	 _joe_: there's outstanding puppet change, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1276396 can I puppetmerge?
[13:01:01] <codenamenoreste>	 yes, it's a patch to allow the visual editor in the project namespace for Swahili Wikipedia
[13:01:14] <_joe_>	 atsukoito: yeah I was about to merge
[13:01:23] <_joe_>	 was waiting for the puppet disable round to finish
[13:01:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295536 (https://phabricator.wikimedia.org/T427117) (owner: 10Codename Noreste)
[13:01:33] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:01:36] <_joe_>	 I'll merge both our changes
[13:01:40] <codenamenoreste>	 but there's something wrong with my home’s wifi, so that's why I can't use my laptop nor WikimediaDebug
[13:01:45] <atsukoito>	 _joe_: thanks
[13:01:55] <_joe_>	 atsukoito: merging, will let you know once it's done
[13:02:14] <Msz2001>	 codenamenoreste: Okay, I can verify whether the patch works after it gets to the test server
[13:02:22] <wikibugs>	 (03Merged) 10jenkins-bot: swwiki: Enable the Visual Editor on the project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295536 (https://phabricator.wikimedia.org/T427117) (owner: 10Codename Noreste)
[13:02:36] <codenamenoreste>	 wait, my wifi works, I'll use WikimediaDebug
[13:02:40] <logmsgbot>	 !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1295536|swwiki: Enable the Visual Editor on the project namespace (T427117)]]
[13:02:43] <_joe_>	 atsukoito: done
[13:02:44] <stashbot>	 T427117: Enable VisualEditor in the Project namespace for Swahili Wikipedia (swwiki) - https://phabricator.wikimedia.org/T427117
[13:02:48] <atsukoito>	 _joe_: thanks
[13:03:15] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[13:03:18] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::redis_sentinel: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295908
[13:03:55] <Msz2001>	 ack
[13:04:02] <codenamenoreste>	 wait what?
[13:04:11] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[13:04:21] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[13:04:26] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8622/co" [puppet] - 10https://gerrit.wikimedia.org/r/1295908 (owner: 10Majavah)
[13:04:31] <logmsgbot>	 !log mszwarc@deploy1003 codenamenoreste, mszwarc: Backport for [[gerrit:1295536|swwiki: Enable the Visual Editor on the project namespace (T427117)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:04:50] <Msz2001>	 That was so quick :o
[13:05:06] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Load self-hosted secure-api.js on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829)
[13:05:08] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829)
[13:05:19] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[13:05:21] <Msz2001>	 codenamenoreste: Are you able to verify the patch or should I?
[13:05:40] <codenamenoreste>	 WikimediaDebug is turned on my laptop
[13:06:42] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' .
[13:07:05] <Msz2001>	 So you can verify the patch now, then :)
[13:07:06] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:07:13] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:08:00] <wikibugs>	 (03PS1) 10Majavah: P:toolforge:legacy_redirector: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295913 (https://phabricator.wikimedia.org/T149804)
[13:08:13] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:08:25] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::redis_sentinel: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295908 (https://phabricator.wikimedia.org/T427799)
[13:08:34] <logmsgbot>	 !log mszwarc@deploy1003 codenamenoreste, mszwarc: Continuing with deployment
[13:08:40] <Msz2001>	 Verified myself
[13:08:48] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8623/console" [puppet] - 10https://gerrit.wikimedia.org/r/1295913 (https://phabricator.wikimedia.org/T149804) (owner: 10Majavah)
[13:09:07] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal atsuko Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295410 https://wikitech.wikimedia.org/wiki/PyBal
[13:09:07] <codenamenoreste>	 I also checked too and I can verify that it works
[13:09:07] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal atsuko Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295410 https://wikitech.wikimedia.org/wiki/PyBal
[13:09:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[13:09:25] <wikibugs>	 (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [extensions/GlobalPreferences] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295875 (https://phabricator.wikimedia.org/T427476) (owner: 10Mszwarc)
[13:09:29] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Fix the creation of the vg_raid0 volume on dse-k8s-wdqs-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis)
[13:09:40] <wikibugs>	 (03PS2) 10Kosta Harlan: hCaptcha: Load self-hosted secure-api.js on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829)
[13:09:46] <wikibugs>	 (03PS2) 10Kosta Harlan: hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829)
[13:09:50] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93438 and previous config saved to /var/cache/conftool/dbconfig/20260601-130949-fceratto.json
[13:10:09] <wikibugs>	 (03PS1) 10Majavah: P:elasticsearch: Migrate inter-node traffic to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295915 (https://phabricator.wikimedia.org/T427799)
[13:10:13] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:10:13] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:10:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8624/co" [puppet] - 10https://gerrit.wikimedia.org/r/1295902 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto)
[13:10:58] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8625/co" [puppet] - 10https://gerrit.wikimedia.org/r/1295915 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah)
[13:10:59] <wikibugs>	 (03Merged) 10jenkins-bot: Add SetGlobalPreference maintenance script [extensions/GlobalPreferences] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295875 (https://phabricator.wikimedia.org/T427476) (owner: 10Mszwarc)
[13:11:25] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal atsuko Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295410 https://wikitech.wikimedia.org/wiki/PyBal
[13:11:25] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal atsuko Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295410 https://wikitech.wikimedia.org/wiki/PyBal
[13:12:02] <codenamenoreste>	 should sfs-block-bypass be removed from the IP block exemption user group? the StopForumSpam extension was removed
[13:12:46] <logmsgbot>	 !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295536|swwiki: Enable the Visual Editor on the project namespace (T427117)]] (duration: 10m 06s)
[13:12:50] <stashbot>	 T427117: Enable VisualEditor in the Project namespace for Swahili Wikipedia (swwiki) - https://phabricator.wikimedia.org/T427117
[13:12:58] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "Seemingly we're the only users of profile::elasticsearch, with everyone else having moved to profile::opensearch::server." [puppet] - 10https://gerrit.wikimedia.org/r/1295915 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah)
[13:12:59] <Msz2001>	 Seems like it can be removed. I think I have seen a task for it somewhere
[13:14:11] <logmsgbot>	 !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1295875|Add SetGlobalPreference maintenance script (T427476)]]
[13:14:15] <stashbot>	 T427476: Add a maintenance script to set global preferences for listed users - https://phabricator.wikimedia.org/T427476
[13:14:16] <atsukoito>	 !log sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service'
[13:14:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:26] <wikibugs>	 (03CR) 10Kosta Harlan: [C:04-2] "Needs to wait for the /static directory to be populated on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan)
[13:15:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:15:55] <logmsgbot>	 !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1295875|Add SetGlobalPreference maintenance script (T427476)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:16:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:16:25] <logmsgbot>	 !log mszwarc@deploy1003 mszwarc: Continuing with deployment
[13:18:24] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs-test1001.eqiad.wmnet
[13:18:36] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1265.eqiad.wmnet with OS trixie
[13:19:07] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs-test2001.codfw.wmnet
[13:20:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:20:34] <logmsgbot>	 !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295875|Add SetGlobalPreference maintenance script (T427476)]] (duration: 06m 22s)
[13:20:38] <stashbot>	 T427476: Add a maintenance script to set global preferences for listed users - https://phabricator.wikimedia.org/T427476
[13:20:54] <Msz2001>	 !log UTC afternoon backpot+config window done
[13:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:57] <atsukoito>	 !log restarted pybal.service on lvs1020
[13:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:21:58] <logmsgbot>	 !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in eqiad/ml-serve-eqiad: maintenance
[13:22:22] <logmsgbot>	 !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in eqiad/ml-serve-eqiad: maintenance
[13:22:25] <atsukoito>	 !log restarted pybal.service on lvs1019
[13:22:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:29] <wikibugs>	 (03PS2) 10Btullis: Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087)
[13:23:31] <wikibugs>	 (03PS2) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087)
[13:23:33] <wikibugs>	 (03PS1) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645)
[13:24:17] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs-test1001.eqiad.wmnet
[13:24:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[13:24:47] <wikibugs>	 (03PS1) 10Codename Noreste: Remove sfsblock-bypass from the IP block exemption user group on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295918 (https://phabricator.wikimedia.org/T427745)
[13:24:51] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs-test2001.codfw.wmnet
[13:25:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[13:26:05] <wikibugs>	 (03PS1) 10Ayounsi: Add InterfaceNoDescription alert [alerts] - 10https://gerrit.wikimedia.org/r/1295919 (https://phabricator.wikimedia.org/T419298)
[13:26:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11972052 (10Jclark-ctr) Errrors are on Sdb and has failed in md1 array    matching serials according to idrac it is in slot 4  `   [Mon Jun  1 12:30:36 2026] I/O error, dev sdb, sector 3750748677 op 0x0:...
[13:26:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis)
[13:27:24] <codenamenoreste>	 I'm on my laptop, and I have another patch to review and deploy: 1295918
[13:27:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:30:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:30:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:30:22] <wikibugs>	 (03CR) 10Jelto: "_IF_ we change the SSH config I'd prefer using a dedicated hostname and port 22 instead of changing the port to 2222 and using the TCP pro" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[13:31:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:31:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:31:12] <wikibugs>	 (03PS1) 10Slyngshede: P:cache:haproxy add image generator information [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338)
[13:31:16] <atsukoito>	 !log restarted pybal.service on lvs2014
[13:31:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] cache::haproxy: remove the lua_contact_info feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1295902 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto)
[13:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:46] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1265.eqiad.wmnet with reason: host reimage
[13:32:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11972086 (10FCeratto-WMF) a:05VRiley-WMF→03FCeratto-WMF Thanks @VRiley-WMF  journald is not showing hardware errors. MariaDB started cleanly, replication is catching up as expected. https://grafana.wi...
[13:35:28] <wikibugs>	 (03PS2) 10Slyngshede: P:cache:haproxy add image generator information [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338)
[13:35:41] <atsukoito>	 !log restarted pybal.service on lvs2013
[13:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:48] <wikibugs>	 (03PS3) 10Btullis: Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087)
[13:35:48] <wikibugs>	 (03PS3) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087)
[13:35:48] <wikibugs>	 (03PS2) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645)
[13:36:13] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy)
[13:36:15] <wikibugs>	 (03PS1) 10Btullis: Declare the webrequest.dumps.dev0 stream in EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295922 (https://phabricator.wikimedia.org/T291645)
[13:36:20] <wikibugs>	 (03PS4) 10Ahmon Dancy: scap.cfg.erb: Add hcaptcha checkout in production [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829)
[13:37:49] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1265.eqiad.wmnet with reason: host reimage
[13:38:57] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106
[13:38:57] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "I reviewed https://codesearch.wmcloud.org/deployed/?q=writeapi and https://global-search.toolforge.org/?q=writeapi&namespaces=2%2C4%2C8&ti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński)
[13:39:02] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106
[13:39:40] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[13:39:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93439 and previous config saved to /var/cache/conftool/dbconfig/20260601-133947-fceratto.json
[13:39:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński)
[13:40:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński)
[13:40:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295918 (https://phabricator.wikimedia.org/T427745) (owner: 10Codename Noreste)
[13:41:12] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] scap.cfg.erb: Add hcaptcha checkout in production [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy)
[13:41:14] <codenamenoreste>	 one patch should be deployed right now
[13:42:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:42:32] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] ratelimit-media: policy and user-class level metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert)
[13:43:38] <wikibugs>	 (03CR) 10Ssingh: "Thanks for confirming @jwodstrcil@wikimedia.org that this doesn't break the Gitlab workflow/experience!" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[13:48:45] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] "I kicked off a manual rebuild of the image and it now worked fine:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295847 (owner: 10Muehlenhoff)
[13:50:12] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:51:09] <codenamenoreste>	 I have T427745 to resolve, and I'm waiting
[13:51:10] <stashbot>	 T427745: Remove sfsblock-bypass from ipblock-exempt group - https://phabricator.wikimedia.org/T427745
[13:51:38] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Reenable read-only ES backups [puppet] - 10https://gerrit.wikimedia.org/r/1295925 (https://phabricator.wikimedia.org/T424661)
[13:52:11] <wikibugs>	 (03CR) 10Jforrester: "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295847 (owner: 10Muehlenhoff)
[13:52:12] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:52:32] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1265.eqiad.wmnet with OS trixie
[13:52:50] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[13:53:12] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:54:39] <codenamenoreste>	 is there any deployer available?
[13:54:48] <wikibugs>	 (03CR) 10Jcrespo: [C:04-2] "Do not merge until tonight's rw run to avoid conflicts." [puppet] - 10https://gerrit.wikimedia.org/r/1295925 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo)
[13:55:12] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:55:15] <Lucas_WMDE>	 jouncebot: nowandnext
[13:55:16] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1300)
[13:55:16] <jouncebot>	 In 0 hour(s) and 34 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1430)
[13:55:21] <Lucas_WMDE>	 I guess I can deploy it now
[13:56:58] <codenamenoreste>	 ^ https://phabricator.wikimedia.org/T427745
[13:57:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295918 (https://phabricator.wikimedia.org/T427745) (owner: 10Codename Noreste)
[13:58:03] <wikibugs>	 (03CR) 10Zabe: "I could also do something horible like canceling the sync after it reached the canaries and let it sit there for a few minutes and see wha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[13:58:36] <wikibugs>	 (03Merged) 10jenkins-bot: Remove sfsblock-bypass from the IP block exemption user group on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295918 (https://phabricator.wikimedia.org/T427745) (owner: 10Codename Noreste)
[13:58:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1295918|Remove sfsblock-bypass from the IP block exemption user group on all wikis (T427745)]]
[13:58:56] <stashbot>	 T427745: Remove sfsblock-bypass from ipblock-exempt group - https://phabricator.wikimedia.org/T427745
[14:00:36] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] "applied" [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[14:01:12] <wikibugs>	 (03CR) 10Ladsgroup: "it wouldn't even make it to the list of top ten horrible things we have done!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[14:01:34] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:01:56] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:sessionstore
[14:02:07] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[14:02:34] <wikibugs>	 (03PS4) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087)
[14:02:38] <wikibugs>	 (03PS3) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645)
[14:03:23] <codenamenoreste>	 Lucas_WMDE when I used WikimediaDebug the sfsblock-bypass right is no longer there (from the test servers)
[14:03:31] <codenamenoreste>	 Should be okay to deploy
[14:03:35] <Lucas_WMDE>	 …
[14:03:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, codenamenoreste: Backport for [[gerrit:1295918|Remove sfsblock-bypass from the IP block exemption user group on all wikis (T427745)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:03:45] <Lucas_WMDE>	 codenamenoreste: please test *now*
[14:04:11] <wikibugs>	 (03CR) 10Btullis: Configure nginx to log requests in ECS format to syslog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[14:04:15] <Lucas_WMDE>	 testing early just makes everything more confusing because I don’t know which server with which version you hit
[14:05:07] <jinxer-wm>	 FIRING: ProbeDown: Service sessionstore2004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:05:11] <codenamenoreste>	 it works!
[14:05:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, codenamenoreste: Continuing with deployment
[14:05:50] <Lucas_WMDE>	 alright, thanks!
[14:06:10] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[14:06:12] <wikibugs>	 (03PS2) 10Ayounsi: Add Mayotte to geo-maps - prefer drmrs [dns] - 10https://gerrit.wikimedia.org/r/1267042
[14:08:29] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis)
[14:09:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295918|Remove sfsblock-bypass from the IP block exemption user group on all wikis (T427745)]] (duration: 11m 06s)
[14:10:00] <wikibugs>	 (03CR) 10Ssingh: Add Mayotte to geo-maps - prefer drmrs [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi)
[14:10:03] <stashbot>	 T427745: Remove sfsblock-bypass from ipblock-exempt group - https://phabricator.wikimedia.org/T427745
[14:10:07] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:10:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:10:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:11:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:11:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:11:34] <Amir1>	 jouncebot: nowandnext
[14:11:34] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 18 minute(s)
[14:11:35] <jouncebot>	 In 0 hour(s) and 18 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1430)
[14:11:45] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Declare the webrequest.dumps.dev0 stream in EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295922 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis)
[14:11:54] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:11:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:59] <Lucas_WMDE>	 (a few minutes ago)
[14:12:11] <Amir1>	 cool cool. I was about to ping you
[14:12:46] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[14:12:52] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[14:13:11] <wikibugs>	 (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295930 (https://phabricator.wikimedia.org/T421797)
[14:15:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:16:05] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[14:17:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:17:51] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295930 (https://phabricator.wikimedia.org/T421797) (owner: 10Ladsgroup)
[14:17:58] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[14:19:19] <wikibugs>	 (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295930 (https://phabricator.wikimedia.org/T421797) (owner: 10Ladsgroup)
[14:19:43] <wikibugs>	 (03PS1) 10Jgiannelos: tegola: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295932
[14:20:07] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:20:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:20:16] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Unrack old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427283#11972441 (10ayounsi) p:05Triage→03Low
[14:20:41] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11972454 (10ayounsi) p:05Triage→03Medium
[14:21:49] <wikibugs>	 (03PS5) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087)
[14:21:49] <wikibugs>	 (03PS4) 10Btullis: Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087)
[14:21:50] <wikibugs>	 (03PS4) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645)
[14:21:52] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade
[14:22:05] <logmsgbot>	 !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99)
[14:22:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:23:12] <wikibugs>	 (03PS19) 10Ayounsi: WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300)
[14:23:22] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Add Mayotte to geo-maps - prefer drmrs [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi)
[14:23:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#11972468 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:23:34] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:25:07] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:25:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:25:15] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:25:23] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[14:26:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:27:16] <logmsgbot>	 !log ladsgroup@deploy1003 Synchronized portals/wikipedia.org/assets: Deploy portals (T421797) (duration: 06m 10s)
[14:27:19] <stashbot>	 T421797: Remove Wikinews from various multilingual portals - https://phabricator.wikimedia.org/T421797
[14:27:38] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] Cleanup old values for turnilo and eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295405 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[14:28:01] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[14:29:48] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis)
[14:29:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07Puppet (Puppet 7.0): slow command processing when introducing lots of new hosts - https://phabricator.wikimedia.org/T341674#11972506 (10LSobanski) 05Open→03Declined This was a one-off problem and we've fully migrated to Puppet 7 now. Resolving.
[14:30:00] <logmsgbot>	 !log ladsgroup@deploy1003 Synchronized portals: Deploy portals (T421797) (duration: 02m 43s)
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1430)
[14:30:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:30:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:30:22] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:30:29] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[14:31:07] <wikibugs>	 (03PS11) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874)
[14:31:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:31:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:33:25] <wikibugs>	 (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto)
[14:33:34] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] docker_registry:  replace rdb2009 with rdb2013 [puppet] - 10https://gerrit.wikimedia.org/r/1294279 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli)
[14:34:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 07Puppet (Puppet 7.0): remove puppet::expose_agent_certs from puppetdb classes - https://phabricator.wikimedia.org/T341374#11972566 (10LSobanski) p:05Medium→03Low
[14:34:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto)
[14:34:17] <wikibugs>	 (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto)
[14:34:30] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert)
[14:35:22] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:35:55] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Decom cookbook should only warn about unexpected matches in Puppet - https://phabricator.wikimedia.org/T297516#11972594 (10LSobanski) This looks resolved. @RLazarus please reopen if you think otherwise.
[14:36:10] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[14:36:22] <wikibugs>	 (03Merged) 10jenkins-bot: Cleanup old values for turnilo and eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295405 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[14:37:29] <wikibugs>	 (03PS6) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087)
[14:37:29] <wikibugs>	 (03PS5) 10Btullis: Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087)
[14:37:29] <wikibugs>	 (03PS5) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645)
[14:37:45] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[14:37:51] <wikibugs>	 06SRE, 10Observability-Alerting, 10Puppet-Core, 13Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253#11972611 (10LSobanski) Untagging IF.
[14:37:51] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[14:37:58] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[14:38:28] <wikibugs>	 (03PS12) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874)
[14:38:45] <wikibugs>	 06SRE, 10netops, 06Traffic-Icebox: experiment with reenabling compression between applayer's TLS terminators and edge caches - https://phabricator.wikimedia.org/T263288#11972619 (10LSobanski) Untagging IF.
[14:39:44] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990#11972623 (10LSobanski) p:05Medium→03Low
[14:40:03] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93440 and previous config saved to /var/cache/conftool/dbconfig/20260601-144002-fceratto.json
[14:41:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10provisioning-automation: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875#11972624 (10LSobanski)
[14:41:24] <wikibugs>	 06SRE, 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Unplanned: [puppet] Remove expired and unused certs from  modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354295#11972625 (10fnegri) p:05High→03Low a:05dcaro→03None > So this task is to remove any...
[14:41:59] <logmsgbot>	 !log atsuko@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:42:10] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:42:14] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis)
[14:42:18] <logmsgbot>	 !log atsuko@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[14:43:19] <wikibugs>	 (03PS3) 10CDanis: cache::haproxy: limit email addresses to reasonable lengths [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto)
[14:43:21] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto)
[14:44:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11972676 (10VRiley-WMF) 05In progress→03Resolved Thank you for the update! Closing this. Please let us know if anything else happens!
[14:45:19] <wikibugs>	 (03CR) 10CDanis: [C:03+2] cache::haproxy: limit email addresses to reasonable lengths [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto)
[14:46:20] <wikibugs>	 (03CR) 10Btullis: "I think that this is done." [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[14:47:10] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:48:01] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[14:49:02] <wikibugs>	 10SRE-Access-Requests: Adding FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T427823 (10jasmine_) 03NEW
[14:49:22] <logmsgbot>	 !log atsuko@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[14:49:29] <wikibugs>	 (03CR) 10JMeybohm: "This looks pretty good, thanks! Two minor nits inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler)
[14:49:52] <logmsgbot>	 !log atsuko@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[14:50:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93441 and previous config saved to /var/cache/conftool/dbconfig/20260601-145010-fceratto.json
[14:50:15] <wikibugs>	 (03PS7) 10Andrew Bogott: designate: remove leftover mcrouter code [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T427189)
[14:50:38] <logmsgbot>	 !log atsuko@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:50:54] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T427189) (owner: 10Andrew Bogott)
[14:51:05] <logmsgbot>	 !log atsuko@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:51:54] <wikibugs>	 (03PS1) 10Federico Ceratto: db1224: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295940 (https://phabricator.wikimedia.org/T427535)
[14:51:59] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade
[14:52:05] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum5003.eqsin.wmnet with OS trixie
[14:52:20] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1209: Upgrading db1209.eqiad.wmnet
[14:52:36] <logmsgbot>	 !log atsuko@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:52:51] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1209: Upgrading db1209.eqiad.wmnet
[14:52:52] <logmsgbot>	 !log atsuko@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:53:07] <jinxer-wm>	 FIRING: ProbeDown: Service sessionstore1005-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore1005-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:54:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1295913 (https://phabricator.wikimedia.org/T149804) (owner: 10Majavah)
[14:54:18] <wikibugs>	 (03PS5) 10Atsuko: Cleanup eventstream-internal [puppet] - 10https://gerrit.wikimedia.org/r/1295406 (https://phabricator.wikimedia.org/T348763)
[14:54:36] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge:legacy_redirector: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295913 (https://phabricator.wikimedia.org/T149804) (owner: 10Majavah)
[14:54:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] profile::mariadb::proxy: Use Puppet types [puppet] - 10https://gerrit.wikimedia.org/r/1294258 (owner: 10Muehlenhoff)
[14:55:06] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1209.eqiad.wmnet with OS trixie
[14:55:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:56:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:56:25] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:56:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] toolforge::elasticsearch::haproxy: Restrict to cloud network [puppet] - 10https://gerrit.wikimedia.org/r/1295398 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff)
[14:57:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11972753 (10MoritzMuehlenhoff)
[14:57:33] <wikibugs>	 (03PS1) 10Jasmine: admin: replacing spare FIDO backed key [puppet] - 10https://gerrit.wikimedia.org/r/1295941 (https://phabricator.wikimedia.org/T427823)
[14:58:07] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:59:18] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Cleanup eventstream-internal [puppet] - 10https://gerrit.wikimedia.org/r/1295406 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[14:59:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11972766 (10Dzahn)
[15:00:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11972767 (10Dzahn) confirmed SSH key out of band
[15:00:17] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93443 and previous config saved to /var/cache/conftool/dbconfig/20260601-150017-fceratto.json
[15:03:12] <wikibugs>	 (03CR) 10Brouberol: [C:04-1] "The ports collide with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295396 I think the ports assigned to the previous opensearch " [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[15:04:37] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service sessionstore1005-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:05:37] <wikibugs>	 (03PS2) 10Jasmine: admin: replacing spare FIDO backed key [puppet] - 10https://gerrit.wikimedia.org/r/1295941 (https://phabricator.wikimedia.org/T427823)
[15:06:25] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:08:27] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:sessionstore
[15:08:36] <wikibugs>	 (03PS13) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007)
[15:09:37] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:10:25] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93445 and previous config saved to /var/cache/conftool/dbconfig/20260601-151024-fceratto.json
[15:10:35] <wikibugs>	 (03PS1) 10Muehlenhoff: autoinstall: Switch to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1295945 (https://phabricator.wikimedia.org/T416707)
[15:10:46] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1209.eqiad.wmnet with reason: host reimage
[15:10:51] <Dreamy_Jazz>	 jouncebot: nowandnext
[15:10:52] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 19 minute(s)
[15:10:52] <jouncebot>	 In 0 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1530)
[15:10:53] <wikibugs>	 (03PS1) 10Dreamy Jazz: hCaptcha: Enable for VisualEditor on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295946 (https://phabricator.wikimedia.org/T425940)
[15:11:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295946 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz)
[15:12:09] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Enable for VisualEditor on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295946 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz)
[15:12:16] <wikibugs>	 (03PS1) 10Kamila Součková: CI: Fix CI pass on template render fail [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295947 (https://phabricator.wikimedia.org/T427307)
[15:12:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] autoinstall: Switch to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1295945 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff)
[15:12:24] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1295946|hCaptcha: Enable for VisualEditor on all WMF wikis (T425940)]]
[15:12:27] <stashbot>	 T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940
[15:12:46] <wikibugs>	 (03CR) 10CWilliams: [C:03+1] db1224: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295940 (https://phabricator.wikimedia.org/T427535) (owner: 10Federico Ceratto)
[15:12:52] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply
[15:13:09] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply
[15:14:04] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] db1224: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295940 (https://phabricator.wikimedia.org/T427535) (owner: 10Federico Ceratto)
[15:14:09] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1295946|hCaptcha: Enable for VisualEditor on all WMF wikis (T425940)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:14:31] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1209.eqiad.wmnet with reason: host reimage
[15:14:43] <wikibugs>	 (03PS1) 10Kamila Součková: .fixtures: remove erroneously committed file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295949
[15:16:37] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment
[15:17:57] <kostajh>	 I have a config patch to sync when Dreamy_Jazz is done
[15:18:05] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] Cleanup eventstream-internal [puppet] - 10https://gerrit.wikimedia.org/r/1295406 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[15:18:21] <wikibugs>	 (03PS1) 10Ottomata: mediawiki.user_change.dev0 - key by user.wiki_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295950 (https://phabricator.wikimedia.org/T426198)
[15:19:28] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply
[15:19:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1295941 (https://phabricator.wikimedia.org/T427823) (owner: 10Jasmine)
[15:19:33] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply
[15:19:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] admin: replacing spare FIDO backed key [puppet] - 10https://gerrit.wikimedia.org/r/1295941 (https://phabricator.wikimedia.org/T427823) (owner: 10Jasmine)
[15:19:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:20:48] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295946|hCaptcha: Enable for VisualEditor on all WMF wikis (T425940)]] (duration: 08m 24s)
[15:20:52] <stashbot>	 T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940
[15:20:57] <Dreamy_Jazz>	 kostajh: Your turn
[15:21:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295802 (owner: 10Kosta Harlan)
[15:21:17] <kostajh>	 Dreamy_Jazz: thx
[15:22:05] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply
[15:22:09] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply
[15:22:14] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Raise SiteVerify error threshold to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295802 (owner: 10Kosta Harlan)
[15:22:18] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply
[15:22:23] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply
[15:22:29] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1295802|hCaptcha: Raise SiteVerify error threshold to 100]]
[15:22:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2005.codfw.wmnet with OS bullseye
[15:24:15] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1295802|hCaptcha: Raise SiteVerify error threshold to 100]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:24:35] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with deployment
[15:24:58] <wikibugs>	 (03PS3) 10Kosta Harlan: hCaptcha: Load self-hosted secure-api.js on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829)
[15:24:58] <wikibugs>	 (03PS3) 10Kosta Harlan: hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829)
[15:25:32] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling
[15:25:43] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1224: Pooling
[15:25:58] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1224.eqiad.wmnet
[15:25:58] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1224.eqiad.wmnet
[15:26:07] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling
[15:26:31] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[15:26:39] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93446 and previous config saved to /var/cache/conftool/dbconfig/20260601-152638-fceratto.json
[15:27:26] <wikibugs>	 (03PS1) 10Dzahn: admin: upgrade Mahmoud Abdelsattar from ldap_only to shell user [puppet] - 10https://gerrit.wikimedia.org/r/1295952 (https://phabricator.wikimedia.org/T427597)
[15:28:45] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295802|hCaptcha: Raise SiteVerify error threshold to 100]] (duration: 06m 15s)
[15:29:42] <wikibugs>	 (03PS2) 10Atsuko: service: services_proxy: prod opensearch-on-k8s services [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248)
[15:30:04] <jouncebot>	 jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1530).
[15:31:32] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1209.eqiad.wmnet with OS trixie
[15:33:20] <wikibugs>	 (03CR) 10TChin: [C:03+1] mediawiki.user_change.dev0 - key by user.wiki_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295950 (https://phabricator.wikimedia.org/T426198) (owner: 10Ottomata)
[15:33:38] <wikibugs>	 (03PS1) 10Eevans: linked-artifacts: deploy v1.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295954 (https://phabricator.wikimedia.org/T427508)
[15:34:36] <wikibugs>	 (03PS2) 10Atsuko: translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294949 (https://phabricator.wikimedia.org/T425377)
[15:36:17] <wikibugs>	 (03CR) 10Eevans: [C:03+2] linked-artifacts: deploy v1.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295954 (https://phabricator.wikimedia.org/T427508) (owner: 10Eevans)
[15:37:46] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[15:37:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage
[15:38:44] <wikibugs>	 (03Merged) 10jenkins-bot: linked-artifacts: deploy v1.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295954 (https://phabricator.wikimedia.org/T427508) (owner: 10Eevans)
[15:38:47] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1224: Pooling
[15:38:51] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling
[15:39:09] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[15:39:12] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1209: Migration of db1209.eqiad.wmnet completed
[15:39:21] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
[15:39:40] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
[15:40:01] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1224.eqiad.wmnet
[15:40:01] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1224.eqiad.wmnet
[15:40:08] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1224.eqiad.wmnet
[15:40:09] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1224.eqiad.wmnet
[15:40:13] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1224: Pooling
[15:40:16] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling
[15:42:22] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] .fixtures: remove erroneously committed file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295949 (owner: 10Kamila Součková)
[15:42:50] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update outlink-topic-model docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295899 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz)
[15:44:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage
[15:44:56] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] CI: Fix CI pass on template render fail (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295947 (https://phabricator.wikimedia.org/T427307) (owner: 10Kamila Součková)
[15:45:12] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Update outlink-topic-model docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295899 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz)
[15:45:22] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1224: Pooling
[15:45:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973017 (10ops-monitoring-bot) Starting pool of db1224 by fceratto@cumin1003: Pooling
[15:45:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973018 (10ops-monitoring-bot) Completed pooling of db1224 by fceratto@cumin1003: Pooling
[15:48:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet
[15:49:08] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling
[15:49:14] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1224: Pooling
[15:49:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973047 (10ops-monitoring-bot) Starting pool of db1224 by fceratto@cumin1003: Pooling
[15:49:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973048 (10ops-monitoring-bot) Completed pooling of db1224 by fceratto@cumin1003: Pooling
[15:50:35] <logmsgbot>	 sukhe@cumin1003 reimage (PID 3686757) is awaiting input
[15:51:08] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum5003.eqsin.wmnet with OS trixie
[15:53:05] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11973050 (10VRiley-WMF) Hey @MatthewVernon thanks for the response on the other ticket. I know tuesdays get a bit meeting heavy for mys...
[15:53:26] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1224: Pooling
[15:56:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2005.codfw.wmnet with OS bullseye
[15:56:46] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.depool (exit_code=97) depool db1224: Pooling
[15:56:55] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling
[15:57:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973082 (10ops-monitoring-bot) Starting pool of db1224 by fceratto@cumin1003: Pooling
[16:00:34] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11973097 (10MatthewVernon) Hi @VRiley-WMF I have oddly-full afternoons on other days at the moment; I could do 14:30-16:30 UTC on Wedne...
[16:01:18] <wikibugs>	 (03CR) 10Atsuko: "Updated the ports to 65xx, checked that there's no collisions." [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[16:01:35] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:01:36] <wikibugs>	 (03PS1) 10Muehlenhoff: autoinstall: Switch to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1295956 (https://phabricator.wikimedia.org/T416707)
[16:01:57] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11973102 (10VRiley-WMF) @MatthewVernon Wednesday the 3rd absolutely works for me! we can start then
[16:02:30] <moritzm>	 !log temporarily remove ganeti2027 from the codfw cluster T427357
[16:02:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:33] <stashbot>	 T427357: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357
[16:03:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet
[16:04:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet
[16:04:57] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti2027 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[16:04:57] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti2027 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[16:05:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to drbd
[16:05:50] <jinxer-wm>	 FIRING: ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:06:01] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[16:07:57] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[16:08:56] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:08:56] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[16:09:18] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1236: Update
[16:09:48] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1236: Update
[16:10:27] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1236.eqiad.wmnet with reason: Kernel update T426633
[16:10:41] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Bump llm ns memory quota to 256Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295958
[16:24:32] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:24:42] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1209: Migration of db1209.eqiad.wmnet completed
[16:24:43] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0)
[16:25:41] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] role::pki::multirootca: remove the Kafka kafka_11 profile [puppet] - 10https://gerrit.wikimedia.org/r/1295023 (owner: 10Elukey)
[16:25:47] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:26:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93455 and previous config saved to /var/cache/conftool/dbconfig/20260601-162653-fceratto.json
[16:27:23] <wikibugs>	 (03PS1) 10Marco Fossati: Carousel: Defer to MobileFrontend lightbox on mobile [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295962 (https://phabricator.wikimedia.org/T427679)
[16:29:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to drbd
[16:29:27] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1236: Update
[16:29:39] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.78 ms
[16:29:53] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1236: Update
[16:30:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295962 (https://phabricator.wikimedia.org/T427679) (owner: 10Marco Fossati)
[16:30:44] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1236: Update
[16:30:54] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1236.eqiad.wmnet
[16:30:55] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1236.eqiad.wmnet
[16:31:06] <wikibugs>	 (03CR) 10Marco Fossati: [C:03+1] Carousel: Defer to MobileFrontend lightbox on mobile [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295962 (https://phabricator.wikimedia.org/T427679) (owner: 10Marco Fossati)
[16:31:13] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[16:31:35] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1236.eqiad.wmnet with reason: Kernel update T426633
[16:34:13] <wikibugs>	 (03CR) 10JHathaway: sre.puppet.disable-merges: New cookbook to disable Puppet merges temporarily (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1295425 (https://phabricator.wikimedia.org/T248872) (owner: 10Muehlenhoff)
[16:34:19] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[16:34:21] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[16:34:27] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1236: Update
[16:34:32] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1236: Update
[16:35:03] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,cluster=wdqs,service=wdqs-main,name=wdqs1015.eqiad.wmnet
[16:35:30] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1236.eqiad.wmnet
[16:35:31] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1236.eqiad.wmnet
[16:35:55] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:36:40] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[16:37:02] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93458 and previous config saved to /var/cache/conftool/dbconfig/20260601-163701-fceratto.json
[16:37:12] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,cluster=wdqs-main,service=wdqs-main,name=wdqs1015.eqiad.wmnet
[16:41:59] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[16:42:24] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1224: Pooling
[16:42:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973208 (10ops-monitoring-bot) Completed pooling of db1224 by fceratto@cumin1003: Pooling
[16:47:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93460 and previous config saved to /var/cache/conftool/dbconfig/20260601-164709-fceratto.json
[16:47:16] <wikibugs>	 (03PS3) 10Dzahn: gerrit: use stunnel with rsync of lfs data [puppet] - 10https://gerrit.wikimedia.org/r/1295500 (https://phabricator.wikimedia.org/T412780)
[16:47:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Adding FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T427823#11973222 (10jasmine_) 05Open→03Resolved a:03jasmine_
[16:47:41] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11973231 (10MatthewVernon) Cool, I've blocked that out in my calendar :)
[16:50:50] <jinxer-wm>	 RESOLVED: ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:51:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Investigate hardware RAID usage in codfw LVS hosts - https://phabricator.wikimedia.org/T426912#11973277 (10BCornwall) 05Open→03Resolved That's a fair point, and considering we're on nvme drives power loss is less of a concern as well since it's non-volatile....
[16:51:25] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Nice and thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[16:51:49] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] kafka-main2006: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288917 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine)
[16:52:06] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] service: services_proxy: prod opensearch-on-k8s services [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[16:52:54] <wikibugs>	 (03CR) 10Daniel Kinzler: Rakefile: Run chart specific tests (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler)
[16:53:32] <wikibugs>	 (03PS1) 10Dzahn: gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967
[16:54:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (owner: 10Dzahn)
[16:56:15] <wikibugs>	 (03PS1) 10Marco Fossati: MultimediaViewer: enable image carousel as a beta feature on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295968 (https://phabricator.wikimedia.org/T426799)
[16:57:18] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93462 and previous config saved to /var/cache/conftool/dbconfig/20260601-165717-fceratto.json
[16:57:30] <logmsgbot>	 !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS trixie
[16:57:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295968 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati)
[16:58:13] <Amir1>	 !log drop flaggedrevs tables on wikinews wikis (T423577)
[16:58:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:16] <stashbot>	 T423577: Undeploy FlaggedRevs from Wikinews and drop FlaggedRevs tables - https://phabricator.wikimedia.org/T423577
[16:59:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd2001.codfw.wmnet to drbd
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1700)
[17:00:04] <jouncebot>	 ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1700).
[17:01:35] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:03:37] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1180: Pooling
[17:03:41] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1180: Pooling
[17:03:58] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1180: Pooling
[17:04:01] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1180: Pooling
[17:04:08] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1180: Pooling
[17:04:10] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1180: Pooling
[17:09:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:10:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[17:10:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd2001.codfw.wmnet to drbd
[17:10:13] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[17:10:55] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.80 ms
[17:16:29] <wikibugs>	 (03CR) 10Majavah: [C:04-1] "This needs more context/a task attached (why are we building a system to track individual users), but also should not be tied to individua" [puppet] - 10https://gerrit.wikimedia.org/r/1294864 (owner: 10Komla Sapaty)
[17:20:21] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1236: Update
[17:20:39] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to <Superset> for <APDube-WMF> - https://phabricator.wikimedia.org/T427553#11973388 (10Raine)
[17:22:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to <Superset> for <APDube-WMF> - https://phabricator.wikimedia.org/T427553#11973401 (10Raine) @Milimetric @Ahoelzl @Ottomata can one of you please approve? Thanks!
[17:28:40] <wikibugs>	 (03PS1) 10Chlod Alejandro: nlwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295976 (https://phabricator.wikimedia.org/T424519)
[17:29:24] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to <Superset> for <APDube-WMF> - https://phabricator.wikimedia.org/T427553#11973446 (10Raine) >>! In T427553#11973400, @Raine wrote: > @Milimetric @Ahoelzl @Ottomata can one of you please approve? Thanks!   Apologies, I hadn't realised this...
[17:31:23] <TheresNoTime>	 jouncebot: nowandnext
[17:31:23] <jouncebot>	 For the next 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1700)
[17:31:24] <jouncebot>	 In 2 hour(s) and 28 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T2000)
[17:32:10] <TheresNoTime>	 chlod: o/
[17:32:14] <chlod>	 \o/
[17:32:25] <TheresNoTime>	 I am going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1295976, a logos change
[17:32:36] <wikibugs>	 (03PS1) 10Audrey Penven: Update config for WikiProjects linking prototype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295978 (https://phabricator.wikimedia.org/T427804)
[17:33:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to <Superset> for <APDube-WMF> - https://phabricator.wikimedia.org/T427553#11973471 (10Milimetric) approved
[17:33:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295976 (https://phabricator.wikimedia.org/T424519) (owner: 10Chlod Alejandro)
[17:33:45] <wikibugs>	 (03PS1) 10Kamila Součková: admin: add apdube-wmf user [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553)
[17:34:16] <wikibugs>	 (03Merged) 10jenkins-bot: nlwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295976 (https://phabricator.wikimedia.org/T424519) (owner: 10Chlod Alejandro)
[17:34:31] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1295976|nlwiki: change to Wikipedia 25 logo (T424519)]]
[17:34:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add apdube-wmf user [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) (owner: 10Kamila Součková)
[17:34:35] <stashbot>	 T424519: Per community Rfc, for the month of June please change to the birthday logo for nl.wikipedia.org - https://phabricator.wikimedia.org/T424519
[17:36:25] <logmsgbot>	 !log samtar@deploy1003 chlod, samtar: Backport for [[gerrit:1295976|nlwiki: change to Wikipedia 25 logo (T424519)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:36:34] <chlod>	 checking
[17:37:31] <chlod>	 looks good :)
[17:37:45] <logmsgbot>	 !log samtar@deploy1003 chlod, samtar: Continuing with deployment
[17:39:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[17:39:44] <jinxer-wm>	 Deployment eventstreams-production in eventstreams at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=eventstreams&var-deployment=eventstreams-production - ...
[17:39:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:42:01] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295976|nlwiki: change to Wikipedia 25 logo (T424519)]] (duration: 07m 29s)
[17:42:04] <stashbot>	 T424519: Per community Rfc, for the month of June please change to the birthday logo for nl.wikipedia.org - https://phabricator.wikimedia.org/T424519
[17:42:19] <TheresNoTime>	 lgtm on prod now
[17:42:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:42:57] <chlod>	 likewise, thank you TheresNoTIme!
[17:43:05] <TheresNoTime>	 np!
[17:44:28] <wikibugs>	 06SRE, 06Traffic: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836 (10ssingh) 03NEW
[17:44:30] <wikibugs>	 06SRE, 06Traffic: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#11973510 (10ssingh) p:05Triage→03Medium
[17:53:03] <logmsgbot>	 !log jasmine@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2006.codfw.wmnet with OS trixie
[17:58:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet
[17:59:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet
[18:01:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain
[18:01:32] <logmsgbot>	 !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[18:01:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain
[18:02:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd2001.codfw.wmnet to plain
[18:02:46] <logmsgbot>	 !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[18:03:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd2001.codfw.wmnet to plain
[18:03:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker1117.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:04:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:04:19] <wikibugs>	 (03CR) 10Ottomata: "Ahhh! Got it!  Great, so this produces to event platform streams, then logstash just consumes them. Okay!" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[18:05:07] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] Configure nginx to log requests in ECS format to syslog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[18:05:11] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[18:05:20] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis)
[18:05:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet
[18:06:07] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] Declare the webrequest.dumps.dev0 stream in EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295922 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis)
[18:06:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet
[18:14:39] <wikibugs>	 (03CR) 10Ottomata: flink-app - default to setting metrics.internal.query-service.port (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[18:14:42] <wikibugs>	 (03PS3) 10Ottomata: flink-app - default to setting metrics.internal.query-service.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216)
[18:14:45] <icinga-wm>	 PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1002 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration
[18:14:51] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] flink-app - default to setting metrics.internal.query-service.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[18:16:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295950 (https://phabricator.wikimedia.org/T426198) (owner: 10Ottomata)
[18:16:39] <wikibugs>	 (03PS1) 10Jdlrobson: styles: Hide donor badge container by default [skins/MinervaNeue] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295990 (https://phabricator.wikimedia.org/T425450)
[18:17:25] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki.user_change.dev0 - key by user.wiki_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295950 (https://phabricator.wikimedia.org/T426198) (owner: 10Ottomata)
[18:17:38] <logmsgbot>	 !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1295950|mediawiki.user_change.dev0 - key by user.wiki_id (T426198)]]
[18:17:41] <stashbot>	 T426198: Event schemas - mediawiki user entity should be wiki aware - https://phabricator.wikimedia.org/T426198
[18:17:49] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app - default to setting metrics.internal.query-service.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[18:19:28] <logmsgbot>	 !log otto@deploy1003 otto: Backport for [[gerrit:1295950|mediawiki.user_change.dev0 - key by user.wiki_id (T426198)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:20:05] <logmsgbot>	 !log otto@deploy1003 otto: Continuing with deployment
[18:23:04] <wikibugs>	 (03CR) 10Volans: [C:04-1] "The idea of the patch is fine, it's a nice addition and I can see when it could be useful." [software/cumin] - 10https://gerrit.wikimedia.org/r/1294990 (owner: 10CDanis)
[18:24:20] <logmsgbot>	 !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295950|mediawiki.user_change.dev0 - key by user.wiki_id (T426198)]] (duration: 06m 42s)
[18:24:23] <stashbot>	 T426198: Event schemas - mediawiki user entity should be wiki aware - https://phabricator.wikimedia.org/T426198
[18:24:37] <wikibugs>	 (03Abandoned) 10Dr0ptp4kt: Reactivate wikimedia.de email addresses for GrowthBook SSO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294372 (https://phabricator.wikimedia.org/T418665) (owner: 10Dr0ptp4kt)
[18:29:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[18:29:44] <jinxer-wm>	 Deployment eventstreams-production in eventstreams at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=eventstreams&var-deployment=eventstreams-production - ...
[18:29:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[18:38:10] <wikibugs>	 (03PS1) 10JHathaway: mx: honor reject policy for DMARC [puppet] - 10https://gerrit.wikimedia.org/r/1295992 (https://phabricator.wikimedia.org/T404884)
[18:38:25] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295992 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway)
[18:44:45] <icinga-wm>	 RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1002 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration
[18:45:21] <ottomata>	 EventStreams is flapping.  Container OOMs not sure why.
[18:45:21] <ottomata>	 https://phabricator.wikimedia.org/T427839
[18:49:09] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295992 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway)
[18:51:40] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] mx: honor reject policy for DMARC [puppet] - 10https://gerrit.wikimedia.org/r/1295992 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway)
[18:53:40] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and Hurricane Electric (2001:504:0:1::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[18:57:52] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:xe-0/1/4 (Peering: Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[19:00:36] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: sync
[19:01:05] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync
[19:01:56] <jasmine_>	 ottomata: working on kafka-main trixie upgrade, (tested on a single host (kafka-main2006) and failed, currently looking through logs) perhaps might it be related? 
[19:02:15] <jasmine_>	 re: EventStreams^
[19:03:46] <jasmine_>	 ah nvm looks like it's been flapping from before the reimage
[19:19:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:23:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: use stunnel with rsync of lfs data [puppet] - 10https://gerrit.wikimedia.org/r/1295500 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn)
[19:28:05] <wikibugs>	 (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.24.0-a7 [vendor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296002 (https://phabricator.wikimedia.org/T353697)
[19:29:00] <wikibugs>	 (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.24.0-a7 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565)
[19:30:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) (owner: 10Arlolra)
[19:34:55] <ottomata>	 jasmine_: yeah since last wednesday, thanks for checking though
[19:36:33] <ottomata>	 its on all pods, and looks present in codfw, although less quickly since there are fewer connections there.
[19:36:36] <ottomata>	 i'm going to try to rever
[19:36:36] <ottomata>	 t
[19:36:53] <ottomata>	 to what we had before last wed. no idea why this would be happening though
[19:40:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1355757424 and 106 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:40:23] <wikibugs>	 (03PS1) 10Ottomata: eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296004 (https://phabricator.wikimedia.org/T427839)
[19:42:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 52216 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:42:38] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296004 (https://phabricator.wikimedia.org/T427839) (owner: 10Ottomata)
[19:43:40] <wikibugs>	 (03CR) 10Btullis: [C:03+1] eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296004 (https://phabricator.wikimedia.org/T427839) (owner: 10Ottomata)
[19:45:59] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply
[19:46:09] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[19:46:56] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[19:47:09] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[19:47:57] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[19:48:29] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[19:54:57] <jinxer-wm>	 FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:59:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:00:05] <jouncebot>	 RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T2000)
[20:00:05] <jouncebot>	 sfaci, RoanKattouw, xxb, jdlrobson, and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:37] <RoanKattouw>	 I can deploy
[20:01:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295504 (https://phabricator.wikimedia.org/T427562) (owner: 10Catrope)
[20:01:41] <xxb>	 hii
[20:01:49] <cjming>	 i'm here for Santi's patch - it can ride along with other config patches
[20:03:26] <arlolra>	 o/
[20:05:00] <wikibugs>	 (03Merged) 10jenkins-bot: passwordlessLogin: Don't immediately error out in unsupported browsers [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295504 (https://phabricator.wikimedia.org/T427562) (owner: 10Catrope)
[20:05:16] <logmsgbot>	 !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1295504|passwordlessLogin: Don't immediately error out in unsupported browsers (T427562)]]
[20:05:19] <stashbot>	 T427562: Users without passkeys and without passkey support in their browser cannot login - https://phabricator.wikimedia.org/T427562
[20:06:12] <wikibugs>	 (03PS1) 10Catrope: Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692)
[20:07:00] <logmsgbot>	 !log catrope@deploy1003 catrope: Backport for [[gerrit:1295504|passwordlessLogin: Don't immediately error out in unsupported browsers (T427562)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:07:49] <wikibugs>	 (03PS2) 10Dzahn: gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967
[20:08:42] <logmsgbot>	 !log catrope@deploy1003 catrope: Continuing with deployment
[20:08:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (owner: 10Dzahn)
[20:09:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope)
[20:12:53] <logmsgbot>	 !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295504|passwordlessLogin: Don't immediately error out in unsupported browsers (T427562)]] (duration: 07m 37s)
[20:12:57] <stashbot>	 T427562: Users without passkeys and without passkey support in their browser cannot login - https://phabricator.wikimedia.org/T427562
[20:13:08] <wikibugs>	 (03PS3) 10Dzahn: gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780)
[20:13:23] <RoanKattouw>	 Next I'll do Santi's patch (cc cjming) together with xxb's patch
[20:13:35] <cjming>	 thanks Roan!!
[20:14:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285412 (https://phabricator.wikimedia.org/T422358) (owner: 10Santiago Faci)
[20:14:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295531 (https://phabricator.wikimedia.org/T427384) (owner: 10XXBlackburnXx)
[20:15:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn)
[20:18:12] <wikibugs>	 (03Merged) 10jenkins-bot: Remove `wgTestKitchenExperimentStreamNames` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285412 (https://phabricator.wikimedia.org/T422358) (owner: 10Santiago Faci)
[20:18:16] <wikibugs>	 (03Merged) 10jenkins-bot: Enable AbuseFilter block action on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295531 (https://phabricator.wikimedia.org/T427384) (owner: 10XXBlackburnXx)
[20:18:30] <logmsgbot>	 !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1285412|Remove `wgTestKitchenExperimentStreamNames` (T422358)]], [[gerrit:1295531|Enable AbuseFilter block action on nlwiki (T427384)]]
[20:18:35] <stashbot>	 T422358: Deprecate and remove Experiment#setStream(streamName) - https://phabricator.wikimedia.org/T422358
[20:18:35] <stashbot>	 T427384: Enable Abusefilter "block" consequence on nlwiki - https://phabricator.wikimedia.org/T427384
[20:20:15] <logmsgbot>	 !log catrope@deploy1003 sfaci, xxblackburnxx, catrope: Backport for [[gerrit:1285412|Remove `wgTestKitchenExperimentStreamNames` (T422358)]], [[gerrit:1295531|Enable AbuseFilter block action on nlwiki (T427384)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:20:58] <RoanKattouw>	 cjming, xxb: Please test your changes (or tell me they're not really testable, they look like they might not be)
[20:21:30] <cjming>	 mine's a no-op
[20:21:50] <xxb>	 RoanKattouw: looks good on my end
[20:21:54] <xxb>	 thanks :)
[20:22:09] <logmsgbot>	 !log catrope@deploy1003 sfaci, xxblackburnxx, catrope: Continuing with deployment
[20:24:32] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:26:18] <logmsgbot>	 !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285412|Remove `wgTestKitchenExperimentStreamNames` (T422358)]], [[gerrit:1295531|Enable AbuseFilter block action on nlwiki (T427384)]] (duration: 07m 48s)
[20:26:23] <stashbot>	 T422358: Deprecate and remove Experiment#setStream(streamName) - https://phabricator.wikimedia.org/T422358
[20:26:23] <stashbot>	 T427384: Enable Abusefilter "block" consequence on nlwiki - https://phabricator.wikimedia.org/T427384
[20:28:43] <arlolra>	 Jdlrobson: you around?
[20:29:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296002 (https://phabricator.wikimedia.org/T353697) (owner: 10Arlolra)
[20:29:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) (owner: 10Arlolra)
[20:29:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope)
[20:34:34] <wikibugs>	 (03PS1) 10Arlolra: Deploy PRV to 5 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296015 (https://phabricator.wikimedia.org/T427851)
[20:36:59] <wikibugs>	 10ops-eqiad, 06DC-Ops: hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T427852 (10RKemper) 03NEW
[20:37:02] <wikibugs>	 (03PS1) 10Ottomata: Revert "eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296016
[20:37:40] <logmsgbot>	 !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T427852 hw failure
[20:37:45] <stashbot>	 T427852: hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T427852
[20:41:53] <wikibugs>	 (03CR) 10Atsuko: [C:03+1] Revert "eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296016 (owner: 10Ottomata)
[20:43:27] <wikibugs>	 (03PS1) 10Ottomata: eventstreams - increase memory to 2.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296018 (https://phabricator.wikimedia.org/T427839)
[20:43:34] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] Revert "eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296016 (owner: 10Ottomata)
[20:45:15] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a7 [vendor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296002 (https://phabricator.wikimedia.org/T353697) (owner: 10Arlolra)
[20:45:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope)
[20:45:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.24.0-a7 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) (owner: 10Arlolra)
[20:45:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[20:45:54] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296016 (owner: 10Ottomata)
[20:46:22] <denisse>	 !incidents
[20:46:22] <sirenbot>	 8038 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin)
[20:46:22] <sirenbot>	 8037 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw)
[20:46:23] <sirenbot>	 8036 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw)
[20:46:23] <sirenbot>	 8035 (RESOLVED)  [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw)
[20:47:24] <wikibugs>	 (03CR) 10Catrope: [C:03+2] Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope)
[20:47:31] <wikibugs>	 (03CR) 10Catrope: [C:03+2] Bump wikimedia/parsoid to 0.24.0-a7 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) (owner: 10Arlolra)
[20:47:38] <RoanKattouw>	 Looks like a CI issue, retrying
[20:49:29] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] eventstreams - increase memory to 2.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296018 (https://phabricator.wikimedia.org/T427839) (owner: 10Ottomata)
[20:50:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope)
[20:51:17] <RoanKattouw>	 ugh now that one is failing because of the Parsoid version mismatch. I'll retry it after the second Parsoid patch lands
[20:51:43] <arlolra>	 :|
[20:51:46] <arlolra>	 sorry about that
[20:51:49] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams - increase memory to 2.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296018 (https://phabricator.wikimedia.org/T427839) (owner: 10Ottomata)
[20:52:32] <wikibugs>	 (03CR) 10Cwhite: Configure nginx to log requests in ECS format to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[20:53:40] <logmsgbot>	 !log atsuko@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[20:54:43] <wikibugs>	 (03PS1) 10Ottomata: Revert "eventstreams - increase memory to 2.5Gi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296019
[20:54:52] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] Revert "eventstreams - increase memory to 2.5Gi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296019 (owner: 10Ottomata)
[20:55:46] <RoanKattouw>	 No that's OK, it's CI's fault for randomly failing
[20:55:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[20:55:55] <RoanKattouw>	 And then I resubmitted the patches in the wrong order
[20:56:18] <wikibugs>	 (03CR) 10Catrope: Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope)
[20:56:21] <wikibugs>	 (03CR) 10Catrope: [C:03+2] Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope)
[20:57:04] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "eventstreams - increase memory to 2.5Gi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296019 (owner: 10Ottomata)
[21:00:05] <jouncebot>	 alexsanford, Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T2100).
[21:00:21] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis)
[21:00:40] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "+1 - Errors in this config can cause logging to stop flowing completely." [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[21:01:49] <wikibugs>	 (03PS1) 10Atsuko: eventstreams: set envoy timeout to 0s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296021 (https://phabricator.wikimedia.org/T427839)
[21:03:04] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a7 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) (owner: 10Arlolra)
[21:03:08] <wikibugs>	 (03Merged) 10jenkins-bot: Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope)
[21:03:45] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] eventstreams: set envoy timeout to 0s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296021 (https://phabricator.wikimedia.org/T427839) (owner: 10Atsuko)
[21:04:20] <maryum>	 preparing to deploy a few security patches
[21:04:21] <logmsgbot>	 !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1296002|Bump wikimedia/parsoid to 0.24.0-a7 (T353697 T415591 T427565)]], [[gerrit:1296003|Bump wikimedia/parsoid to 0.24.0-a7 (T427565)]], [[gerrit:1296009|Redirect Special:AccountRecovery to the shared domain (T427692)]]
[21:04:23] <RoanKattouw>	 maryum: I'm still deploying, sorry
[21:04:26] <maryum>	 no worries
[21:04:31] <maryum>	 just let me know
[21:04:34] <stashbot>	 T353697: Parsoid/legacy parser {{Pre}} template rendering difference - https://phabricator.wikimedia.org/T353697
[21:04:34] <stashbot>	 T415591: Template source used in shadow attribute - https://phabricator.wikimedia.org/T415591
[21:04:35] <RoanKattouw>	 CI was being difficult, sorry for going over time
[21:04:35] <stashbot>	 T427565: CTT tasks week of 2026-05-29 - https://phabricator.wikimedia.org/T427565
[21:04:36] <stashbot>	 T427692: Special:AccountRecovery never allows itself to be used - https://phabricator.wikimedia.org/T427692
[21:05:29] <wikibugs>	 (03PS1) 10Jdlrobson: Donor Delight Badge: Add dependency on mw.user [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296022 (https://phabricator.wikimedia.org/T427850)
[21:05:37] <RoanKattouw>	 After my deploy maryum should do the security deploys, and then when that's done I can deploy Jdlrobson's patches if he's available by then
[21:05:51] <maryum>	 sounds good thanks
[21:06:07] <logmsgbot>	 !log catrope@deploy1003 catrope, arlolra: Backport for [[gerrit:1296002|Bump wikimedia/parsoid to 0.24.0-a7 (T353697 T415591 T427565)]], [[gerrit:1296003|Bump wikimedia/parsoid to 0.24.0-a7 (T427565)]], [[gerrit:1296009|Redirect Special:AccountRecovery to the shared domain (T427692)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:06:16] <sbassett>	 We have two reg security patches today, and then one change to PS.php that needs to go out...
[21:07:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 253866936 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:07:46] <logmsgbot>	 !log atsuko@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[21:07:58] <logmsgbot>	 !log atsuko@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[21:08:22] <logmsgbot>	 !log atsuko@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[21:09:05] <RoanKattouw>	 I tested my patch and it works
[21:09:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2870200 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:09:10] <RoanKattouw>	 arlolra: Let me know when you're done testing
[21:09:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[21:09:15] <arlolra>	 done, lgtm
[21:09:18] <logmsgbot>	 !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply
[21:09:26] <logmsgbot>	 !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[21:09:32] <arlolra>	 RoanKattouw: ^
[21:09:33] <logmsgbot>	 !log catrope@deploy1003 catrope, arlolra: Continuing with deployment
[21:10:18] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[21:12:46] <wikibugs>	 (03PS2) 10Atsuko: eventstreams: set envoy timeout to 0s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296021 (https://phabricator.wikimedia.org/T427839)
[21:13:41] <logmsgbot>	 !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296002|Bump wikimedia/parsoid to 0.24.0-a7 (T353697 T415591 T427565)]], [[gerrit:1296003|Bump wikimedia/parsoid to 0.24.0-a7 (T427565)]], [[gerrit:1296009|Redirect Special:AccountRecovery to the shared domain (T427692)]] (duration: 09m 20s)
[21:13:48] <stashbot>	 T353697: Parsoid/legacy parser {{Pre}} template rendering difference - https://phabricator.wikimedia.org/T353697
[21:13:49] <stashbot>	 T415591: Template source used in shadow attribute - https://phabricator.wikimedia.org/T415591
[21:13:49] <stashbot>	 T427565: CTT tasks week of 2026-05-29 - https://phabricator.wikimedia.org/T427565
[21:13:49] <stashbot>	 T427692: Special:AccountRecovery never allows itself to be used - https://phabricator.wikimedia.org/T427692
[21:14:01] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] eventstreams: set envoy timeout to 0s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296021 (https://phabricator.wikimedia.org/T427839) (owner: 10Atsuko)
[21:14:12] <RoanKattouw>	 maryum: All yours, please ping me when you're done
[21:14:40] <maryum>	 yayyyy
[21:16:17] <wikibugs>	 (03PS1) 10Reedy: Add maintenance script to scrape SVG render files [extensions/timeline] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296024
[21:16:24] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams: set envoy timeout to 0s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296021 (https://phabricator.wikimedia.org/T427839) (owner: 10Atsuko)
[21:16:41] <arlolra>	 RoanKattouw: thanks!
[21:21:05] <maryum>	 running scap for the first security patch
[21:21:10] <maryum>	 one of two patches to be deployed
[21:21:28] <maryum>	 then sbassett will do a PS.php deploy after that
[21:26:13] <wikibugs>	 (03PS1) 10Jdlrobson: styles: Limit selector to badge client pref [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296028 (https://phabricator.wikimedia.org/T427407)
[21:27:22] <maryum>	 first scap done preparing to run second scap
[21:27:32] <maryum>	 !log Deployed security fix for T427235
[21:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:15] <wikibugs>	 (03PS1) 10Zabe: maintain-views: Loosen views for filerevision table [puppet] - 10https://gerrit.wikimedia.org/r/1296029 (https://phabricator.wikimedia.org/T426804)
[21:32:59] <logmsgbot>	 !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply
[21:33:08] <logmsgbot>	 !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[21:34:36] <wikibugs>	 (03CR) 10Eevans: [C:03+2] Add component/cassandra50 for Cassandra 5.0.x releases [puppet] - 10https://gerrit.wikimedia.org/r/1287923 (https://phabricator.wikimedia.org/T418419) (owner: 10Eevans)
[21:35:16] <logmsgbot>	 !log atsuko@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[21:35:36] <maryum>	 !log Deployed security fix for T427611
[21:35:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:45] <maryum>	 sbassett go ahead with PS.php
[21:35:55] <logmsgbot>	 !log atsuko@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[21:36:42] <wikibugs>	 (03CR) 10Zabe: "Yeah we should the task first, but I would still do this prior to stop writing to production since otherwise folks will assume the tables " [puppet] - 10https://gerrit.wikimedia.org/r/1281756 (https://phabricator.wikimedia.org/T425191) (owner: 10Zabe)
[21:42:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:45:38] <sbassett>	 Deploying update to PS.php now…
[21:50:43] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[21:51:46] <sbassett>	 !log Deployed updated mitigation for T326691
[21:51:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:17] <JJMC89>	 ottomata and atsukoito: thank you
[21:54:23] <atsukoito>	 JJMC89: my pleasure
[21:56:39] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[21:58:23] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[21:59:10] <wikibugs>	 (03PS1) 10Jdlrobson: Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati)
[21:59:38] <wikibugs>	 (03PS2) 10Jdlrobson: Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati)
[22:00:56] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[22:06:03] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply
[22:06:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati)
[22:06:49] <sbassett>	 Ok, we should be done with security deployments for today.
[22:07:45] <logmsgbot>	 !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply
[22:11:30] <wikibugs>	 (03PS3) 10Zabe: maintain-views: Drop image and oldimage tables [puppet] - 10https://gerrit.wikimedia.org/r/1281756 (https://phabricator.wikimedia.org/T425191)
[22:19:18] <wikibugs>	 (03CR) 10VolkerE: [C:03+1] styles: Limit selector to badge client pref [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296028 (https://phabricator.wikimedia.org/T427407) (owner: 10Jdlrobson)
[22:28:17] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Add maintenance script to scrape SVG render files [extensions/timeline] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296024 (owner: 10Reedy)
[22:29:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add maintenance script to scrape SVG render files [extensions/timeline] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296024 (owner: 10Reedy)
[22:30:08] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1296024|Add maintenance script to scrape SVG render files]]
[22:31:54] <logmsgbot>	 !log reedy@deploy1003 reedy: Backport for [[gerrit:1296024|Add maintenance script to scrape SVG render files]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:32:16] <logmsgbot>	 !log reedy@deploy1003 reedy: Continuing with deployment
[22:36:31] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296024|Add maintenance script to scrape SVG render files]] (duration: 06m 22s)
[22:38:30] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "I compare it and it looks correct based on what's on oldimage. I'll deploy it tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1296029 (https://phabricator.wikimedia.org/T426804) (owner: 10Zabe)
[22:45:07] <wikibugs>	 (03PS4) 10Dzahn: gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780)
[22:53:55] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and Hurricane Electric (2001:504:0:1::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[22:54:36] <wikibugs>	 (03PS8) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112)
[22:56:18] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic)
[22:57:52] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:xe-0/1/4 (Peering: Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:58:15] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: "I've added hiera values for the test cluster, too." [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic)
[22:58:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:58:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:58:40] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and Hurricane Electric (2001:504:0:1::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[22:59:58] <Jdlrobson>	 o/ will need to do some deploys in web team deploy window. Please let me know soonish if there is good reason not to.
[23:00:05] <jouncebot>	 Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T2300)
[23:00:13] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1295967/8626/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn)
[23:00:46] <rzl>	 Jdlrobson: nothing from the SRE side, have a good deploy
[23:00:46] <wikibugs>	 (03PS3) 10Jdlrobson: Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati)
[23:01:00] <wikibugs>	 (03PS4) 10Jdlrobson: Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati)
[23:01:06] <Jdlrobson>	 thanks rzl 
[23:02:13] <wikibugs>	 (03PS5) 10Dzahn: gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780)
[23:02:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296022 (https://phabricator.wikimedia.org/T427850) (owner: 10Jdlrobson)
[23:02:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296028 (https://phabricator.wikimedia.org/T427407) (owner: 10Jdlrobson)
[23:04:38] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6015.*
[23:05:31] <wikibugs>	 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11974498 (10BCornwall) 05Open→03Resolved Checked again after a weekend and things seem fine. Repooling and will check on it again to make double-sure but we should be good. Thanks!
[23:05:36] <wikibugs>	 (03Merged) 10jenkins-bot: Donor Delight Badge: Add dependency on mw.user [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296022 (https://phabricator.wikimedia.org/T427850) (owner: 10Jdlrobson)
[23:05:38] <wikibugs>	 (03Merged) 10jenkins-bot: styles: Limit selector to badge client pref [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296028 (https://phabricator.wikimedia.org/T427407) (owner: 10Jdlrobson)
[23:05:59] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1296022|Donor Delight Badge: Add dependency on mw.user (T427850)]], [[gerrit:1296028|styles: Limit selector to badge client pref (T427407)]]
[23:06:04] <stashbot>	 T427850: TypeError: Cannot read properties of undefined (reading 'set') - https://phabricator.wikimedia.org/T427850
[23:06:05] <stashbot>	 T427407: Search icon appears on the left on mobile while logged out on certain wikis - https://phabricator.wikimedia.org/T427407
[23:07:24] <wikibugs>	 (03PS1) 10Bvibber: Update name and address for bvibber, drop dead blog from planet [puppet] - 10https://gerrit.wikimedia.org/r/1296038
[23:07:43] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1296022|Donor Delight Badge: Add dependency on mw.user (T427850)]], [[gerrit:1296028|styles: Limit selector to badge client pref (T427407)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:07:47] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6013 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[23:10:41] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] "Happy to +2 and deploy this, just LMK if you're ready." [puppet] - 10https://gerrit.wikimedia.org/r/1296038 (owner: 10Bvibber)
[23:11:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:11:22] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment
[23:13:16] <wikibugs>	 (03CR) 10Dzahn: "the goal is for this to be identical to before, removing the custom rsync server config and the "if active_host" around it. the quickdatac" [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn)
[23:14:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:14:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:15:14] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Update name and address for bvibber, drop dead blog from planet [puppet] - 10https://gerrit.wikimedia.org/r/1296038 (owner: 10Bvibber)
[23:15:32] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296022|Donor Delight Badge: Add dependency on mw.user (T427850)]], [[gerrit:1296028|styles: Limit selector to badge client pref (T427407)]] (duration: 09m 33s)
[23:15:38] <stashbot>	 T427850: TypeError: Cannot read properties of undefined (reading 'set') - https://phabricator.wikimedia.org/T427850
[23:15:38] <stashbot>	 T427407: Search icon appears on the left on mobile while logged out on certain wikis - https://phabricator.wikimedia.org/T427407
[23:16:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:16:54] <Jdlrobson>	 beginning next set of changes
[23:17:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati)
[23:17:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295962 (https://phabricator.wikimedia.org/T427679) (owner: 10Marco Fossati)
[23:19:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:19:46] <wikibugs>	 (03Abandoned) 10Jdlrobson: styles: Hide donor badge container by default [skins/MinervaNeue] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295990 (https://phabricator.wikimedia.org/T425450) (owner: 10Jdlrobson)
[23:19:56] <wikibugs>	 (03Merged) 10jenkins-bot: Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati)
[23:20:02] <wikibugs>	 (03Merged) 10jenkins-bot: Carousel: Defer to MobileFrontend lightbox on mobile [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295962 (https://phabricator.wikimedia.org/T427679) (owner: 10Marco Fossati)
[23:20:20] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1295963|Make MultimediaViewer compatible with MobileFrontend legacy parser (T427542)]], [[gerrit:1295962|Carousel: Defer to MobileFrontend lightbox on mobile (T427679)]]
[23:20:24] <stashbot>	 T427542: [Image Browsing] Carousel: MMV fails to load when clicking on carousel items that correspond to lazy image placeholders (legacy parser only) - https://phabricator.wikimedia.org/T427542
[23:20:25] <stashbot>	 T427679: [Image Browsing] Carousel: Users should not see desktop MMV experience when clicking an image - https://phabricator.wikimedia.org/T427679
[23:21:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:22:03] <logmsgbot>	 !log jdlrobson@deploy1003 mfossati, jdlrobson: Backport for [[gerrit:1295963|Make MultimediaViewer compatible with MobileFrontend legacy parser (T427542)]], [[gerrit:1295962|Carousel: Defer to MobileFrontend lightbox on mobile (T427679)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:23:28] <logmsgbot>	 !log jdlrobson@deploy1003 mfossati, jdlrobson: Continuing with deployment
[23:25:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:26:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:27:37] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295963|Make MultimediaViewer compatible with MobileFrontend legacy parser (T427542)]], [[gerrit:1295962|Carousel: Defer to MobileFrontend lightbox on mobile (T427679)]] (duration: 07m 17s)
[23:27:41] <stashbot>	 T427542: [Image Browsing] Carousel: MMV fails to load when clicking on carousel items that correspond to lazy image placeholders (legacy parser only) - https://phabricator.wikimedia.org/T427542
[23:27:42] <stashbot>	 T427679: [Image Browsing] Carousel: Users should not see desktop MMV experience when clicking an image - https://phabricator.wikimedia.org/T427679
[23:29:53] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] "Code looks good but I'd like @joe to verify that this is the way we want to handle it." [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede)
[23:30:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T427852#11974620 (10Jclark-ctr) This server is out of warranty @rkemper. but I am looking at it right now
[23:31:44] <wikibugs>	 (03CR) 10BCornwall: "resetting to 0 as I can't find docs on `X-Image-Generator`" [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede)
[23:34:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:35:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:39:36] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1296045
[23:39:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1296045 (owner: 10TrainBranchBot)
[23:40:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:40:35] <Jdlrobson>	 (done)
[23:41:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:42:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11974646 (10colewhite)
[23:44:20] <wikibugs>	 (03PS1) 10Scott French: scap.cfg.erb: Temporarily pin mediawiki_runtime_image [puppet] - 10https://gerrit.wikimedia.org/r/1296036 (https://phabricator.wikimedia.org/T418200)
[23:44:20] <wikibugs>	 (03CR) 10Scott French: "Apparently, I completely forgot that this was statically defined in [0] rather than somehow following what we do in MediaWiki image builds" [puppet] - 10https://gerrit.wikimedia.org/r/1296036 (https://phabricator.wikimedia.org/T418200) (owner: 10Scott French)
[23:52:29] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1296045 (owner: 10TrainBranchBot)