[00:03:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:05:25] FIRING: SystemdUnitFailed: kernel-purge.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:55] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:11:50] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:16:50] RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:18:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [00:19:26] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:33] !imm [00:19:41] !incidents [00:19:42] 8035 (ACKED) [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw) [00:19:42] 8034 (RESOLVED) OutboundMXQueueHigh sre (mx-out1001:9154 eqiad) [00:23:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:24:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:27:38] PROBLEM - MD RAID on centrallog1002 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:27:39] ACKNOWLEDGEMENT - MD RAID on centrallog1002 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T427748 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:27:51] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748 (10ops-monitoring-bot) 03NEW [00:53:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:58:45] RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:58:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:03:40] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [01:03:42] RECOVERY - Confd vcl based reload on cp6016 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [01:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:09:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1295788 [01:09:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1295788 (owner: 10TrainBranchBot) [01:10:08] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:10:08] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:11:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:11:08] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:13:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [01:13:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:18:23] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1295788 (owner: 10TrainBranchBot) [01:20:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [01:20:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:22:02] !incidents [01:22:03] 8036 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [01:22:03] 8035 (RESOLVED) [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw) [01:22:03] 8034 (RESOLVED) OutboundMXQueueHigh sre (mx-out1001:9154 eqiad) [01:25:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [01:25:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:36:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [01:36:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:37:07] !ack 8037 [01:37:07] 8037 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [01:55:09] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11970439 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty will check to see what is available from decom servers [01:56:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [01:56:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [02:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:56] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:13:55] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [04:05:40] FIRING: SystemdUnitFailed: kernel-purge.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:24:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:47:52] PROBLEM - ganeti-noded running on ganeti1028 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [05:48:52] RECOVERY - ganeti-noded running on ganeti1028 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [06:44:56] (03CR) 10Muehlenhoff: [C:03+2] Mark the wikidough ports as intentionally open to the world [puppet] - 10https://gerrit.wikimedia.org/r/1295431 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [06:47:13] (03CR) 10Slyngshede: [C:03+1] "That's a lot of groups" [puppet] - 10https://gerrit.wikimedia.org/r/1295467 (owner: 10Muehlenhoff) [06:47:51] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - https://phabricator.wikimedia.org/T427469#11970599 (10Marostegui) There seem to be way more alerts with this problem {F85870686} [06:47:58] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [06:48:34] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [07:00:05] Amir1, urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T0700). [07:00:05] WMDE-Fisch, atsukoito, and xxb: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:17] \o [07:00:21] nyaa [07:00:30] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11970613 (10MoritzMuehlenhoff) [07:00:51] I'll self serve and start with my stuff [07:01:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [07:01:17] (03CR) 10Brouberol: [C:03+1] "We will need to redeploy everything using these fields. Could you ping me when this is merged, I'll redeploy all kerberized kubernetes app" [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) (owner: 10Elukey) [07:01:25] (03CR) 10Muehlenhoff: [C:03+2] Switch the pki:root role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294958 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [07:06:01] xxb: I see on the ticket to your patch that there's a comment to wait at least a week before you should come to a conclusion. 🤔 [07:06:59] i mean 16 support 0 oppose. i can wait but fine. [07:07:15] True ;-) [07:07:27] I just saw that as well. [07:10:18] (03CR) 10Muehlenhoff: [C:03+2] Switch rpkivalidator role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294930 (owner: 10Muehlenhoff) [07:10:46] PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100% [07:13:01] (03Merged) 10jenkins-bot: Update VE core submodule to master (9cf5524e7) [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [07:13:07] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:13:14] RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [07:13:38] !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1294826|Update VE core submodule to master (9cf5524e7) (T424232)]] [07:13:41] T424232: VisualDiff does not show change of a main+details edit - https://phabricator.wikimedia.org/T424232 [07:13:55] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:14:46] (03PS1) 10Brouberol: test-kitchen: reach out to the growthbook-api through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570) [07:19:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [07:23:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [07:25:00] Hmmm building the containers takes quite long.... [07:25:33] But I also don't see anything in the logs. [07:26:23] (03CR) 10Fabfur: [C:03+1] cache::haproxy: limit email addresses to reasonable lengths [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto) [07:27:14] Ah now it works ^^' [07:27:59] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11970653 (10ayounsi) `--move-vlan` is only made to migrate core DCs from legacy to new per rack vlans. Let me know if its worth spending... [07:28:25] (03CR) 10Fabfur: "Will this requires a general haproxykafka roll-restart?" [puppet] - 10https://gerrit.wikimedia.org/r/1295020 (owner: 10Elukey) [07:28:56] FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:31:22] !log wmde-fisch@deploy1003 wmde-fisch: Backport for [[gerrit:1294826|Update VE core submodule to master (9cf5524e7) (T424232)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:31:26] T424232: VisualDiff does not show change of a main+details edit - https://phabricator.wikimedia.org/T424232 [07:31:37] Testing [07:32:06] !log wmde-fisch@deploy1003 wmde-fisch: Continuing with deployment [07:34:31] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295368 (https://phabricator.wikimedia.org/T426764) (owner: 10Brouberol) [07:35:27] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: enable sync pods to egress to our s3 endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295368 (https://phabricator.wikimedia.org/T426764) (owner: 10Brouberol) [07:38:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [07:38:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [07:38:56] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:40:48] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:40:48] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:41:39] FIRING: CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (gre) (185.15.58.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad+%28gre%29 - https://alerts.wikimedia.org/?q=alertname%3DCoreB [07:41:57] Deployment is somehow soooo slow today.... [07:42:10] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:42:42] its monday morning for them too ¯\_(ツ)_/¯ [07:42:56] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:43:31] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11970702 (10MatthewVernon) @VRiley-WMF sure; backend's can't be meaningfully depooled, so it'd be a case of "do one, check everything has recovered OK, move on to the next". [not sure if it's ea... [07:44:30] xxb: I fear we won't make your change ... there's this other config patch I need to merge still :-/ [07:44:46] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:45:00] You could put it into the afternoon slot though. [07:45:11] sure ill try get it this evening or tomorrow [07:45:13] !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294826|Update VE core submodule to master (9cf5524e7) (T424232)]] (duration: 31m 34s) [07:45:16] T424232: VisualDiff does not show change of a main+details edit - https://phabricator.wikimedia.org/T424232 [07:45:18] (03CR) 10Santiago Faci: test-kitchen: reach out to the growthbook-api through the mesh (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [07:45:23] afternoon ill have to do other irl stuff ;/ [07:45:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295454 (https://phabricator.wikimedia.org/T427484) (owner: 10Svantje Lilienthal) [07:46:39] FIRING: [2x] CoreBGPDown: ... [07:46:39] Core BGP session down between cr1-drmrs and cr2-eqiad (gre) (185.15.58.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad+%28gre%29 - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:47:03] (03PS2) 10Muehlenhoff: mirrors: Disable osbpo sync [puppet] - 10https://gerrit.wikimedia.org/r/1294980 (https://phabricator.wikimedia.org/T416707) [07:47:14] (03Merged) 10jenkins-bot: Disable the creation of synthetic main refs in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295454 (https://phabricator.wikimedia.org/T427484) (owner: 10Svantje Lilienthal) [07:47:31] !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1295454|Disable the creation of synthetic main refs in production (T427484)]] [07:47:34] T427484: Disable the creation of synthetic main refs in production - https://phabricator.wikimedia.org/T427484 [07:49:43] atsukoito: seems like there won't be any enough time left for the ttm config change this morning :/ [07:50:05] jouncebot: next [07:50:06] In 2 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1000) [07:50:40] well if it's OK we could possibly extend the backport window? [07:51:06] dcausse: let's move it further then [07:51:18] tuesday? [07:51:22] !log wmde-fisch@deploy1003 lilients, wmde-fisch: Backport for [[gerrit:1295454|Disable the creation of synthetic main refs in production (T427484)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:51:35] atsukoito: sounds good, tuesday same time [07:51:58] I'll update the patch/page, thanks [07:52:03] thanks! [07:52:34] !log wmde-fisch@deploy1003 lilients, wmde-fisch: Continuing with deployment [07:52:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294949 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:54:19] (03CR) 10Muehlenhoff: [C:03+2] Bitu: Switch to idm-sre-approval@wikimedia.org for notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295467 (owner: 10Muehlenhoff) [07:56:13] !log add no_p2p term to pfw1-codfw BGP_fundraising_export - T423384 [07:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:17] T423384: Investigate internal rejected prefixes - https://phabricator.wikimedia.org/T423384 [07:57:58] (03PS2) 10Brouberol: test-kitchen: reach out to the growthbook-api through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570) [07:58:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295531 (https://phabricator.wikimedia.org/T427384) (owner: 10XXBlackburnXx) [07:58:58] !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295454|Disable the creation of synthetic main refs in production (T427484)]] (duration: 11m 26s) [07:59:01] T427484: Disable the creation of synthetic main refs in production - https://phabricator.wikimedia.org/T427484 [07:59:31] Deployments done. [08:00:11] (03CR) 10Brouberol: test-kitchen: reach out to the growthbook-api through the mesh (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [08:05:40] FIRING: SystemdUnitFailed: kernel-purge.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:20] (03PS1) 10Kosta Harlan: hCaptcha: Raise SiteVerify error threshold to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295802 [08:08:24] (03CR) 10Santiago Faci: [C:03+2] test-kitchen: reach out to the growthbook-api through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [08:11:45] (03Merged) 10jenkins-bot: test-kitchen: reach out to the growthbook-api through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295794 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [08:12:36] 06SRE, 10SRE-swift-storage: tests for wmf/rewrite.py should use standard thumbnail size (and should also work) - https://phabricator.wikimedia.org/T427757 (10MatthewVernon) 03NEW [08:13:31] (03CR) 10Urbanecm: [C:03+1] "No objection, although...did it really not work the whole time?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [08:13:48] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:13:48] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:16:39] RESOLVED: [2x] CoreBGPDown: ... [08:16:39] Core BGP session down between cr1-drmrs and cr2-eqiad (gre) (185.15.58.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad+%28gre%29 - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:17:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:18:42] (03PS1) 10MVernon: rewrite_integration: use a standard thumbnail size [puppet] - 10https://gerrit.wikimedia.org/r/1295804 (https://phabricator.wikimedia.org/T427757) [08:20:20] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: tests for wmf/rewrite.py should use standard thumbnail size (and should also work) - https://phabricator.wikimedia.org/T427757#11970801 (10MatthewVernon) [08:20:24] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11970802 (10MatthewVernon) [08:24:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:24:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [08:24:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T426633)', diff saved to https://phabricator.wikimedia.org/P93405 and previous config saved to /var/cache/conftool/dbconfig/20260601-082442-fceratto.json [08:31:31] (03PS1) 10Ayounsi: Add RejectingBGPPrefixes alert [alerts] - 10https://gerrit.wikimedia.org/r/1295805 (https://phabricator.wikimedia.org/T423384) [08:31:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T426633)', diff saved to https://phabricator.wikimedia.org/P93406 and previous config saved to /var/cache/conftool/dbconfig/20260601-083146-fceratto.json [08:32:08] (03CR) 10JMeybohm: ratelimit-media: policy and user-class level metrics (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [08:33:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 380533912 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:33:55] (03CR) 10CI reject: [V:04-1] Add RejectingBGPPrefixes alert [alerts] - 10https://gerrit.wikimedia.org/r/1295805 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi) [08:34:38] (03PS1) 10Muehlenhoff: node20-slim: Fix image build [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295847 [08:34:50] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2744880 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:35:14] (03CR) 10JMeybohm: [C:03+1] kafka-main2010: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288921 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [08:35:25] (03PS2) 10Ayounsi: Add RejectingBGPPrefixes alert [alerts] - 10https://gerrit.wikimedia.org/r/1295805 (https://phabricator.wikimedia.org/T423384) [08:36:39] (03CR) 10JMeybohm: [C:03+1] "Two nits but feel free to ignore" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [08:36:40] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11970923 (10mahmoud.abdelsattar.wmde) Dear @Dzahn .. I've confirmed the SSH key with my email. Thanks a lot! [08:40:46] PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100% [08:41:49] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295430 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [08:41:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P93407 and previous config saved to /var/cache/conftool/dbconfig/20260601-084154-fceratto.json [08:43:14] RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [08:50:07] 06SRE, 06Infrastructure-Foundations, 10netops: Don't announce OSPF routes in unicast BGP on Nokia SR-Linux - https://phabricator.wikimedia.org/T423430#11971143 (10ayounsi) Once this is fixed we can remove `|ibgp` from the [[ https://gerrit.wikimedia.org/r/c/operations/alerts/+/1295805 | RejectingBGPPrefixes... [08:52:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P93408 and previous config saved to /var/cache/conftool/dbconfig/20260601-085202-fceratto.json [08:52:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:52:26] (03PS1) 10Dpogorzelski: ml-serve: update eqiad kserve/knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295850 [08:52:31] (03CR) 10FNegri: "Is T426804 a blocker for this?" [puppet] - 10https://gerrit.wikimedia.org/r/1281756 (https://phabricator.wikimedia.org/T425191) (owner: 10Zabe) [08:57:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:59:55] (03CR) 10CWilliams: [C:03+1] sre.mysql.pool: Support depooling unreachable hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1294265 (https://phabricator.wikimedia.org/T427381) (owner: 10Federico Ceratto) [09:02:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T426633)', diff saved to https://phabricator.wikimedia.org/P93409 and previous config saved to /var/cache/conftool/dbconfig/20260601-090209-fceratto.json [09:02:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:02:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93410 and previous config saved to /var/cache/conftool/dbconfig/20260601-090237-fceratto.json [09:04:37] (03CR) 10CWilliams: cookbooks/sre/mysql/decommission: add cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [09:05:07] (03PS1) 10Jelto: miscweb: update wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295852 (https://phabricator.wikimedia.org/T414405) [09:08:46] (03CR) 10Jelto: [C:03+2] miscweb: update wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295852 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [09:09:13] (03PS3) 10Giuseppe Lavagetto: cache::haproxy: support wikilink style usernames in UAs [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992) [09:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:11:31] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279#11971248 (10atsuko) 05In progress→03Resolved a:03atsuko Needed to create kerberos principal that matches the uni... [09:11:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:11:57] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1177: Upgrading db1177.eqiad.wmnet [09:12:36] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1177: Upgrading db1177.eqiad.wmnet [09:12:50] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [09:13:22] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [09:14:25] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [09:15:22] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [09:16:45] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.pool: Support depooling unreachable hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1294265 (https://phabricator.wikimedia.org/T427381) (owner: 10Federico Ceratto) [09:16:46] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting Access to Analytics Data Lake for Dlulisa-WMF - https://phabricator.wikimedia.org/T427197#11971265 (10atsuko) Could you please re-check that you have the access to the tables if you do `kinit wmf-ldlulisa`. [09:17:55] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1177.eqiad.wmnet with OS trixie [09:18:27] (03CR) 10Federico Ceratto: [C:03+1] "LGTM, just reviewing the description and checking the CI ran successfully" [puppet] - 10https://gerrit.wikimedia.org/r/1295804 (https://phabricator.wikimedia.org/T427757) (owner: 10MVernon) [09:20:48] (03Merged) 10jenkins-bot: sre.mysql.pool: Support depooling unreachable hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1294265 (https://phabricator.wikimedia.org/T427381) (owner: 10Federico Ceratto) [09:24:42] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Raise SiteVerify error threshold to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295802 (owner: 10Kosta Harlan) [09:28:49] (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1295859 (https://phabricator.wikimedia.org/T427345) [09:29:29] (03PS1) 10Atsuko: flink: updating control to jdk21 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295860 (https://phabricator.wikimedia.org/T427774) [09:29:51] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog: Investigate Code 414 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11971334 (10MLechvien-WMF) p:05Medium→03Low [09:31:02] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1177.eqiad.wmnet with reason: host reimage [09:31:08] (03CR) 10Brouberol: [C:03+1] flink: updating control to jdk21 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295860 (https://phabricator.wikimedia.org/T427774) (owner: 10Atsuko) [09:31:58] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting Access to Analytics Data Lake for Dlulisa-WMF - https://phabricator.wikimedia.org/T427197#11971338 (10atsuko) 05In progress→03Invalid Confirmed that the access is already present, no change needed. [09:33:16] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1295859 (https://phabricator.wikimedia.org/T427345) (owner: 10Marostegui) [09:34:25] 10SRE-tools, 06DBA, 10Spicerack: Provide downtime duration information in sre.mysql cookbooks - https://phabricator.wikimedia.org/T427780#11971342 (10Marostegui) @elukey any input on this? [09:34:29] (03CR) 10JavierMonton: [C:03+1] flink: updating control to jdk21 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295860 (https://phabricator.wikimedia.org/T427774) (owner: 10Atsuko) [09:34:34] (03CR) 10MVernon: [C:03+2] rewrite_integration: use a standard thumbnail size [puppet] - 10https://gerrit.wikimedia.org/r/1295804 (https://phabricator.wikimedia.org/T427757) (owner: 10MVernon) [09:34:34] 10SRE-tools, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Provide downtime duration information in sre.mysql cookbooks - https://phabricator.wikimedia.org/T427780#11971346 (10Marostegui) p:05Triage→03Medium [09:34:40] (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1295859 (https://phabricator.wikimedia.org/T427345) (owner: 10Marostegui) [09:34:43] (03CR) 10Atsuko: [V:03+2 C:03+2] flink: updating control to jdk21 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295860 (https://phabricator.wikimedia.org/T427774) (owner: 10Atsuko) [09:35:07] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: host reimage [09:35:27] 10SRE-tools, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Provide downtime duration information in sre.mysql cookbooks - https://phabricator.wikimedia.org/T427780#11971353 (10Marostegui) [09:37:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:37:59] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: tests for wmf/rewrite.py should use standard thumbnail size (and should also work) - https://phabricator.wikimedia.org/T427757#11971355 (10MatthewVernon) 05Open→03Resolved ` mvernon@ms-fe1009:~$ python3 /usr/local/lib/python3.9/dist-packages/wmf/rewrite... [09:38:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1055: Upgrading es1055.eqiad.wmnet [09:38:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1055: Upgrading es1055.eqiad.wmnet [09:39:11] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1055.eqiad.wmnet with OS trixie [09:39:19] (03CR) 10JMeybohm: [C:04-1] profile::kafka: remove kafka_11 profile occurrences (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1295022 (owner: 10Elukey) [09:40:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:41:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:42:53] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295861 [09:45:25] RESOLVED: SystemdUnitFailed: kernel-purge.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:44] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295861 (owner: 10Muehlenhoff) [09:50:34] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [09:51:21] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [09:51:39] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1177.eqiad.wmnet with OS trixie [09:53:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1055.eqiad.wmnet with reason: host reimage [09:54:40] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [09:56:05] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [09:58:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1055.eqiad.wmnet with reason: host reimage [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1000) [10:00:29] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1177: Migration of db1177.eqiad.wmnet completed [10:02:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93414 and previous config saved to /var/cache/conftool/dbconfig/20260601-100252-fceratto.json [10:03:20] (03CR) 10Majavah: [C:03+2] firewall::client: Fix default for qos [puppet] - 10https://gerrit.wikimedia.org/r/1294948 (owner: 10Majavah) [10:07:27] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [10:08:23] (03CR) 10Marostegui: [C:03+1] "It is not used." [puppet] - 10https://gerrit.wikimedia.org/r/1294248 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [10:09:57] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [10:13:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P93415 and previous config saved to /var/cache/conftool/dbconfig/20260601-101300-fceratto.json [10:13:56] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 43591296 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:14:56] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 154872 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:15:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1055.eqiad.wmnet with OS trixie [10:15:57] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [10:16:18] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1055: repool after upgrade [10:16:55] (03CR) 10JMeybohm: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [10:21:12] (03PS1) 10Muehlenhoff: thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1295866 [10:23:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P93418 and previous config saved to /var/cache/conftool/dbconfig/20260601-102308-fceratto.json [10:25:09] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11971500 (10cmooney) >>! In T427393#11970653, @ayounsi wrote: > `--move-vlan` is only made to migrate core DCs from legacy to new per rac... [10:25:43] (03CR) 10Btullis: [C:03+2] Create a new role for the dse-k8s nodes that are dedicated to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [10:31:51] (03CR) 10Muehlenhoff: [C:03+2] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1295866 (owner: 10Muehlenhoff) [10:32:11] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1295866 (owner: 10Muehlenhoff) [10:33:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93421 and previous config saved to /var/cache/conftool/dbconfig/20260601-103316-fceratto.json [10:34:00] !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy2003.codfw.wmnet [10:35:40] (03CR) 10Muehlenhoff: [C:03+2] dbproxy: Remove unused public type [puppet] - 10https://gerrit.wikimedia.org/r/1294248 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [10:36:02] (03PS2) 10Muehlenhoff: profile::mariadb::proxy: Use Puppet types [puppet] - 10https://gerrit.wikimedia.org/r/1294258 [10:36:55] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11971558 (10MoritzMuehlenhoff) [10:39:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294258 (owner: 10Muehlenhoff) [10:40:18] !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2003.codfw.wmnet [10:45:19] jouncebot: nowandnext [10:45:19] For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1000) [10:45:19] In 2 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1300) [10:45:26] Anyone mind me deploying? [10:45:59] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1177: Migration of db1177.eqiad.wmnet completed [10:46:00] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:47:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es2055 to es1 codfw primary T427032', diff saved to https://phabricator.wikimedia.org/P93424 and previous config saved to /var/cache/conftool/dbconfig/20260601-104739-marostegui.json [10:47:43] T427032: Migrate es1 section to Debian Trixie - https://phabricator.wikimedia.org/T427032 [10:48:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1050 to es1 eqiad primary T427032', diff saved to https://phabricator.wikimedia.org/P93425 and previous config saved to /var/cache/conftool/dbconfig/20260601-104837-marostegui.json [10:49:16] (03PS1) 10Marostegui: wmnet: Update es1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1295872 (https://phabricator.wikimedia.org/T427032) [10:51:13] (03CR) 10Majavah: [C:04-1] designate: remove leftover mcrouter code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T427189) (owner: 10Andrew Bogott) [10:52:04] (03CR) 10Marostegui: [C:03+2] wmnet: Update es1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1295872 (https://phabricator.wikimedia.org/T427032) (owner: 10Marostegui) [10:52:25] !log marostegui@dns1004 START - running authdns-update [10:54:09] !log marostegui@dns1004 END - running authdns-update [10:56:03] testing deployment... [10:57:46] (03PS1) 10Muehlenhoff: thumbor: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295873 [11:00:19] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295873 (owner: 10Muehlenhoff) [11:01:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [11:01:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93427 and previous config saved to /var/cache/conftool/dbconfig/20260601-110121-fceratto.json [11:01:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1055: repool after upgrade [11:04:58] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [11:06:12] (03PS1) 10Btullis: Fix the creation of the vg_raid0 volume on dse-k8s-wdqs-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) [11:08:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93429 and previous config saved to /var/cache/conftool/dbconfig/20260601-110820-fceratto.json [11:08:38] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [11:10:03] (03CR) 10Marostegui: [C:03+1] profile::mariadb::proxy: Use Puppet types [puppet] - 10https://gerrit.wikimedia.org/r/1294258 (owner: 10Muehlenhoff) [11:10:58] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:11:06] (03CR) 10Kamila Součková: [C:03+1] api-gateway: Pre-teardown deprecation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294957 (https://phabricator.wikimedia.org/T426881) (owner: 10Clément Goubert) [11:11:34] (03CR) 10Atsuko: [C:03+1] Fix the creation of the vg_raid0 volume on dse-k8s-wdqs-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [11:12:09] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:13:55] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11971652 (10VRiley-WMF) 05Open→03In progress [11:14:00] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11971653 (10VRiley-WMF) Updating BIOS [11:14:22] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:14:31] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql: Auto-lint imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1293666 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:14:32] (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: increase downtime for backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1295440 (https://phabricator.wikimedia.org/T427614) (owner: 10Jelto) [11:15:27] (03PS2) 10Btullis: Fix the creation of the vg_raid0 volume on dse-k8s-wdqs-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) [11:16:11] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [11:16:43] (03CR) 10Kamila Součková: "LGTM, I'll merge when ready to proceed. Thank you!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295044 (owner: 10Scott French) [11:17:08] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:18:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P93430 and previous config saved to /var/cache/conftool/dbconfig/20260601-111827-fceratto.json [11:19:11] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: increase downtime for backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1295440 (https://phabricator.wikimedia.org/T427614) (owner: 10Jelto) [11:19:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:00] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:21:09] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:22:37] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:22:45] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:22:51] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:22:53] !log installing imagemagick security updates [11:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:58] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11971669 (10VRiley-WMF) BIOS is now at 1.21.1 (previous was 1.12.1). Moving onto iDRAC [11:23:43] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:23:48] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:24:31] (03CR) 10Kamila Součková: [C:03+1] "Thank you <3" [puppet] - 10https://gerrit.wikimedia.org/r/1295057 (owner: 10Scott French) [11:24:39] (03PS1) 10Mszwarc: Add SetGlobalPreference maintenance script [extensions/GlobalPreferences] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295875 (https://phabricator.wikimedia.org/T427476) [11:25:16] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:28:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P93432 and previous config saved to /var/cache/conftool/dbconfig/20260601-112835-fceratto.json [11:28:45] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:28:50] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:29:02] (03CR) 10Kamila Součková: [C:03+1] "Yup, /bin/nodejs is in the package file list. Thank you!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295847 (owner: 10Muehlenhoff) [11:29:49] FIRING: HelmReleaseBadStatus: Helm release wdqs-internal/main on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:32:09] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:32:39] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:32:49] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:33:25] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:33:28] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:34:42] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:34:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs-internal/main on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:36:29] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:37:50] !log installing Exim security updates [11:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:32] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11971696 (10VRiley-WMF) iDRAC has been completed. moving onto Non-expander storage backplane [11:38:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93433 and previous config saved to /var/cache/conftool/dbconfig/20260601-113843-fceratto.json [11:39:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [11:39:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93434 and previous config saved to /var/cache/conftool/dbconfig/20260601-113911-fceratto.json [11:46:56] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11971700 (10VRiley-WMF) Firmware (BIOS, iDRAC and Non-expander storage backplane) have been updated (I thought they were up to date before, but new information was pointed out to me). Through iDRAC I can... [11:49:46] (03CR) 10Ladsgroup: "yeah probably but also I rather wait we stop writing to the old tables in production since that's going to give users a bit more time." [puppet] - 10https://gerrit.wikimedia.org/r/1281756 (https://phabricator.wikimedia.org/T425191) (owner: 10Zabe) [11:50:14] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: update eqiad kserve/knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295850 (owner: 10Dpogorzelski) [11:52:19] (03CR) 10Ladsgroup: [C:03+1] profile::mariadb::proxy: Use Puppet types [puppet] - 10https://gerrit.wikimedia.org/r/1294258 (owner: 10Muehlenhoff) [11:55:58] (03CR) 10Ladsgroup: "We could also enable it for ten minutes and then revert it. It's not as great as enabling it gradually but it could work for most purposes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [11:59:30] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in eqiad/ml-serve-eqiad: maintenance [11:59:30] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in eqiad/ml-serve-eqiad: maintenance [12:02:09] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:03:09] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:04:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [12:04:39] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2027.codfw.wmnet [12:04:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [12:05:09] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:05:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [12:06:09] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:07:09] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:07:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2005.codfw.wmnet to drbd [12:09:09] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:11:28] !log dpogorzelski@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=inference,name=eqiad [12:13:57] (03PS1) 10Bartosz Wójtowicz: ml-services: Update recommendation-api-ng memory limit to 2Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295880 [12:15:15] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in eqiad/ml-serve-eqiad: maintenance [12:15:50] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in eqiad/ml-serve-eqiad: maintenance [12:16:41] (03CR) 10Ozge: [C:03+1] ml-services: Update recommendation-api-ng memory limit to 2Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295880 (owner: 10Bartosz Wójtowicz) [12:17:38] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update recommendation-api-ng memory limit to 2Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295880 (owner: 10Bartosz Wójtowicz) [12:17:54] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:18:51] (03CR) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [12:18:57] (03PS2) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) [12:19:57] (03Merged) 10jenkins-bot: ml-services: Update recommendation-api-ng memory limit to 2Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295880 (owner: 10Bartosz Wójtowicz) [12:20:12] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:21:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/GlobalPreferences] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295875 (https://phabricator.wikimedia.org/T427476) (owner: 10Mszwarc) [12:22:13] (03CR) 10JMeybohm: [C:04-1] ratelimit: Add CACHE_KEY_PREFIX configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [12:22:28] (03PS3) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) [12:22:40] (03CR) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [12:23:29] (03PS4) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) [12:23:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2005.codfw.wmnet to drbd [12:24:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:12] FIRING: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:26] (03PS3) 10Btullis: Fix the creation of the vg_raid0 volume on dse-k8s-wdqs-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) [12:26:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [12:26:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [12:26:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2005.codfw.wmnet to plain [12:27:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2005.codfw.wmnet to plain [12:27:51] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:28:18] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] node20-slim: Fix image build [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295847 (owner: 10Muehlenhoff) [12:28:31] (03PS5) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) [12:28:36] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:28:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [12:29:20] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:30:12] RESOLVED: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:32:02] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295888 [12:35:07] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:35:44] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:39:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93435 and previous config saved to /var/cache/conftool/dbconfig/20260601-123926-fceratto.json [12:41:49] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:42:10] (03CR) 10Clément Goubert: ratelimit-media: policy and user-class level metrics (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [12:42:13] (03PS2) 10Clément Goubert: ratelimit-media: policy and user-class level metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) [12:42:44] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:43:25] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:44:02] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:44:25] (03PS1) 10Bartosz Wójtowicz: ml-services: Update outlink-topic-model docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295899 (https://phabricator.wikimedia.org/T418493) [12:46:08] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface lswtest-d8-eqiad:mgmt0 () - https://phabricator.wikimedia.org/T415109#11971857 (10VRiley-WMF) @Papaul out of curiousity, should we still be keeping this ticket open? Or is it safe to close out now? [12:46:54] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:47:07] (03PS1) 10Atsuko: service: services_proxy: prod opensearch-on-k8s services [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) [12:47:31] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:47:36] (03CR) 10Majavah: [C:03+1] toolforge::elasticsearch::haproxy: Restrict to cloud network [puppet] - 10https://gerrit.wikimedia.org/r/1295398 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [12:47:39] (03PS4) 10Giuseppe Lavagetto: cache::haproxy: support wikilink style usernames in UAs [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992) [12:47:39] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: remove the lua_contact_info feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1295902 (https://phabricator.wikimedia.org/T414300) [12:48:39] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:49:28] (03CR) 10Majavah: [C:04-1] "This profile is used to configure the Cloud VPS outbound email relays (`mx-out*.cloudinfra.eqiad1.wikimedia.cloud`) which need to accept o" [puppet] - 10https://gerrit.wikimedia.org/r/1284671 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [12:49:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93436 and previous config saved to /var/cache/conftool/dbconfig/20260601-124934-fceratto.json [12:49:35] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:50:54] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:52:24] (03CR) 10Bartosz Dziewoński: "Hmm, on a closer look, it worked *some* of the time. It doesn't work today, and it didn't work when it was added in October 2024 (change 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [12:52:25] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:53:32] (03CR) 10Majavah: [C:03+1] Revert "labswiki: Disallow account autocreation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [12:55:16] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: support wikilink style usernames in UAs [puppet] - 10https://gerrit.wikimedia.org/r/1276396 (https://phabricator.wikimedia.org/T423992) (owner: 10Giuseppe Lavagetto) [12:55:27] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [12:55:30] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:55:34] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [12:55:38] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:55:41] (03PS3) 10Bartosz Dziewoński: Revert "labswiki: Disallow account autocreation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 [12:55:41] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:55:41] (03CR) 10AikoChou: [C:03+1] ml-services: Update outlink-topic-model docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295899 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [12:55:44] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:55:46] (03CR) 10Atsuko: [C:03+2] service: move eventstreams-internal to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [12:55:47] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:55:52] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:55:55] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [12:55:58] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:56:01] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [12:56:04] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [12:56:06] (03CR) 10Bartosz Dziewoński: "I think this commit message explains the situation better, thanks for prompting me to investigate." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [12:56:07] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:56:10] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:56:13] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:56:16] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:56:19] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:56:19] !log dpogorzelski@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=inference,name=eqiad [12:57:32] (03CR) 10Atsuko: [C:03+2] "merging with fabfur" [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [12:58:51] (03PS1) 10Majavah: P:wmcs::kubeadm::etcd: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295905 (https://phabricator.wikimedia.org/T427799) [12:58:58] (03PS2) 10Majavah: P:wmcs::kubeadm::etcd: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295905 (https://phabricator.wikimedia.org/T427799) [12:59:40] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [12:59:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93437 and previous config saved to /var/cache/conftool/dbconfig/20260601-125941-fceratto.json [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1300) [13:00:05] codenamenoreste and Msz2001: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:22] I can’t deploy, in a meeting [13:00:30] I can deploy the patches. codenamenoreste: shall I start with yours? [13:00:57] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8621/co" [puppet] - 10https://gerrit.wikimedia.org/r/1295905 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah) [13:01:00] _joe_: there's outstanding puppet change, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1276396 can I puppetmerge? [13:01:01] yes, it's a patch to allow the visual editor in the project namespace for Swahili Wikipedia [13:01:14] <_joe_> atsukoito: yeah I was about to merge [13:01:23] <_joe_> was waiting for the puppet disable round to finish [13:01:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295536 (https://phabricator.wikimedia.org/T427117) (owner: 10Codename Noreste) [13:01:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:01:36] <_joe_> I'll merge both our changes [13:01:40] but there's something wrong with my home’s wifi, so that's why I can't use my laptop nor WikimediaDebug [13:01:45] _joe_: thanks [13:01:55] <_joe_> atsukoito: merging, will let you know once it's done [13:02:14] codenamenoreste: Okay, I can verify whether the patch works after it gets to the test server [13:02:22] (03Merged) 10jenkins-bot: swwiki: Enable the Visual Editor on the project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295536 (https://phabricator.wikimedia.org/T427117) (owner: 10Codename Noreste) [13:02:36] wait, my wifi works, I'll use WikimediaDebug [13:02:40] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1295536|swwiki: Enable the Visual Editor on the project namespace (T427117)]] [13:02:43] <_joe_> atsukoito: done [13:02:44] T427117: Enable VisualEditor in the Project namespace for Swahili Wikipedia (swwiki) - https://phabricator.wikimedia.org/T427117 [13:02:48] _joe_: thanks [13:03:15] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:03:18] (03PS1) 10Majavah: P:toolforge::redis_sentinel: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295908 [13:03:55] ack [13:04:02] wait what? [13:04:11] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:04:21] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:04:26] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8622/co" [puppet] - 10https://gerrit.wikimedia.org/r/1295908 (owner: 10Majavah) [13:04:31] !log mszwarc@deploy1003 codenamenoreste, mszwarc: Backport for [[gerrit:1295536|swwiki: Enable the Visual Editor on the project namespace (T427117)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:04:50] That was so quick :o [13:05:06] (03PS1) 10Kosta Harlan: hCaptcha: Load self-hosted secure-api.js on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829) [13:05:08] (03PS1) 10Kosta Harlan: hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) [13:05:19] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:05:21] codenamenoreste: Are you able to verify the patch or should I? [13:05:40] WikimediaDebug is turned on my laptop [13:06:42] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:07:05] So you can verify the patch now, then :) [13:07:06] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:07:13] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:08:00] (03PS1) 10Majavah: P:toolforge:legacy_redirector: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295913 (https://phabricator.wikimedia.org/T149804) [13:08:13] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:08:25] (03PS2) 10Majavah: P:toolforge::redis_sentinel: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295908 (https://phabricator.wikimedia.org/T427799) [13:08:34] !log mszwarc@deploy1003 codenamenoreste, mszwarc: Continuing with deployment [13:08:40] Verified myself [13:08:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8623/console" [puppet] - 10https://gerrit.wikimedia.org/r/1295913 (https://phabricator.wikimedia.org/T149804) (owner: 10Majavah) [13:09:07] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal atsuko Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295410 https://wikitech.wikimedia.org/wiki/PyBal [13:09:07] I also checked too and I can verify that it works [13:09:07] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal atsuko Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295410 https://wikitech.wikimedia.org/wiki/PyBal [13:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:09:25] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [extensions/GlobalPreferences] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295875 (https://phabricator.wikimedia.org/T427476) (owner: 10Mszwarc) [13:09:29] (03CR) 10Btullis: [C:03+2] Fix the creation of the vg_raid0 volume on dse-k8s-wdqs-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295874 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [13:09:40] (03PS2) 10Kosta Harlan: hCaptcha: Load self-hosted secure-api.js on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829) [13:09:46] (03PS2) 10Kosta Harlan: hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) [13:09:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93438 and previous config saved to /var/cache/conftool/dbconfig/20260601-130949-fceratto.json [13:10:09] (03PS1) 10Majavah: P:elasticsearch: Migrate inter-node traffic to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295915 (https://phabricator.wikimedia.org/T427799) [13:10:13] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:10:13] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:10:31] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8624/co" [puppet] - 10https://gerrit.wikimedia.org/r/1295902 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [13:10:58] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8625/co" [puppet] - 10https://gerrit.wikimedia.org/r/1295915 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah) [13:10:59] (03Merged) 10jenkins-bot: Add SetGlobalPreference maintenance script [extensions/GlobalPreferences] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295875 (https://phabricator.wikimedia.org/T427476) (owner: 10Mszwarc) [13:11:25] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal atsuko Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295410 https://wikitech.wikimedia.org/wiki/PyBal [13:11:25] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal atsuko Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295410 https://wikitech.wikimedia.org/wiki/PyBal [13:12:02] should sfs-block-bypass be removed from the IP block exemption user group? the StopForumSpam extension was removed [13:12:46] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295536|swwiki: Enable the Visual Editor on the project namespace (T427117)]] (duration: 10m 06s) [13:12:50] T427117: Enable VisualEditor in the Project namespace for Swahili Wikipedia (swwiki) - https://phabricator.wikimedia.org/T427117 [13:12:58] (03CR) 10Majavah: [V:03+1] "Seemingly we're the only users of profile::elasticsearch, with everyone else having moved to profile::opensearch::server." [puppet] - 10https://gerrit.wikimedia.org/r/1295915 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah) [13:12:59] Seems like it can be removed. I think I have seen a task for it somewhere [13:14:11] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1295875|Add SetGlobalPreference maintenance script (T427476)]] [13:14:15] T427476: Add a maintenance script to set global preferences for listed users - https://phabricator.wikimedia.org/T427476 [13:14:16] !log sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service' [13:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:26] (03CR) 10Kosta Harlan: [C:04-2] "Needs to wait for the /static directory to be populated on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [13:15:09] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:15:55] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1295875|Add SetGlobalPreference maintenance script (T427476)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:16:09] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:16:25] !log mszwarc@deploy1003 mszwarc: Continuing with deployment [13:18:24] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs-test1001.eqiad.wmnet [13:18:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1265.eqiad.wmnet with OS trixie [13:19:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs-test2001.codfw.wmnet [13:20:09] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:20:34] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295875|Add SetGlobalPreference maintenance script (T427476)]] (duration: 06m 22s) [13:20:38] T427476: Add a maintenance script to set global preferences for listed users - https://phabricator.wikimedia.org/T427476 [13:20:54] !log UTC afternoon backpot+config window done [13:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:57] !log restarted pybal.service on lvs1020 [13:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:09] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:21:58] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in eqiad/ml-serve-eqiad: maintenance [13:22:22] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in eqiad/ml-serve-eqiad: maintenance [13:22:25] !log restarted pybal.service on lvs1019 [13:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:29] (03PS2) 10Btullis: Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) [13:23:31] (03PS2) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) [13:23:33] (03PS1) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) [13:24:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs-test1001.eqiad.wmnet [13:24:25] (03CR) 10CI reject: [V:04-1] Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [13:24:47] (03PS1) 10Codename Noreste: Remove sfsblock-bypass from the IP block exemption user group on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295918 (https://phabricator.wikimedia.org/T427745) [13:24:51] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs-test2001.codfw.wmnet [13:25:36] (03CR) 10CI reject: [V:04-1] Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [13:26:05] (03PS1) 10Ayounsi: Add InterfaceNoDescription alert [alerts] - 10https://gerrit.wikimedia.org/r/1295919 (https://phabricator.wikimedia.org/T419298) [13:26:21] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11972052 (10Jclark-ctr) Errrors are on Sdb and has failed in md1 array matching serials according to idrac it is in slot 4 ` [Mon Jun 1 12:30:36 2026] I/O error, dev sdb, sector 3750748677 op 0x0:... [13:26:28] (03CR) 10CI reject: [V:04-1] logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [13:27:24] I'm on my laptop, and I have another patch to review and deploy: 1295918 [13:27:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:09] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:30:09] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:30:22] (03CR) 10Jelto: "_IF_ we change the SSH config I'd prefer using a dedicated hostname and port 22 instead of changing the port to 2222 and using the TCP pro" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [13:31:09] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:31:09] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:31:12] (03PS1) 10Slyngshede: P:cache:haproxy add image generator information [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) [13:31:16] !log restarted pybal.service on lvs2014 [13:31:17] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] cache::haproxy: remove the lua_contact_info feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1295902 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [13:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1265.eqiad.wmnet with reason: host reimage [13:32:20] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11972086 (10FCeratto-WMF) a:05VRiley-WMF→03FCeratto-WMF Thanks @VRiley-WMF journald is not showing hardware errors. MariaDB started cleanly, replication is catching up as expected. https://grafana.wi... [13:35:28] (03PS2) 10Slyngshede: P:cache:haproxy add image generator information [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) [13:35:41] !log restarted pybal.service on lvs2013 [13:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:48] (03PS3) 10Btullis: Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) [13:35:48] (03PS3) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) [13:35:48] (03PS2) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) [13:36:13] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [13:36:15] (03PS1) 10Btullis: Declare the webrequest.dumps.dev0 stream in EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295922 (https://phabricator.wikimedia.org/T291645) [13:36:20] (03PS4) 10Ahmon Dancy: scap.cfg.erb: Add hcaptcha checkout in production [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) [13:37:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1265.eqiad.wmnet with reason: host reimage [13:38:57] (03PS2) 10Bartosz Dziewoński: Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 [13:38:57] (03CR) 10Bartosz Dziewoński: "I reviewed https://codesearch.wmcloud.org/deployed/?q=writeapi and https://global-search.toolforge.org/?q=writeapi&namespaces=2%2C4%2C8&ti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [13:39:02] (03PS3) 10Bartosz Dziewoński: Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 [13:39:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:39:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93439 and previous config saved to /var/cache/conftool/dbconfig/20260601-133947-fceratto.json [13:39:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [13:40:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [13:40:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295918 (https://phabricator.wikimedia.org/T427745) (owner: 10Codename Noreste) [13:41:12] (03CR) 10Ssingh: [C:03+2] scap.cfg.erb: Add hcaptcha checkout in production [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [13:41:14] one patch should be deployed right now [13:42:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:32] (03CR) 10JMeybohm: [C:03+1] ratelimit-media: policy and user-class level metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [13:43:38] (03CR) 10Ssingh: "Thanks for confirming @jwodstrcil@wikimedia.org that this doesn't break the Gitlab workflow/experience!" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [13:48:45] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] "I kicked off a manual rebuild of the image and it now worked fine:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295847 (owner: 10Muehlenhoff) [13:50:12] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:51:09] I have T427745 to resolve, and I'm waiting [13:51:10] T427745: Remove sfsblock-bypass from ipblock-exempt group - https://phabricator.wikimedia.org/T427745 [13:51:38] (03PS1) 10Jcrespo: dbbackups: Reenable read-only ES backups [puppet] - 10https://gerrit.wikimedia.org/r/1295925 (https://phabricator.wikimedia.org/T424661) [13:52:11] (03CR) 10Jforrester: "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295847 (owner: 10Muehlenhoff) [13:52:12] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:52:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1265.eqiad.wmnet with OS trixie [13:52:50] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [13:53:12] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:54:39] is there any deployer available? [13:54:48] (03CR) 10Jcrespo: [C:04-2] "Do not merge until tonight's rw run to avoid conflicts." [puppet] - 10https://gerrit.wikimedia.org/r/1295925 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [13:55:12] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:55:15] jouncebot: nowandnext [13:55:16] For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1300) [13:55:16] In 0 hour(s) and 34 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1430) [13:55:21] I guess I can deploy it now [13:56:58] ^ https://phabricator.wikimedia.org/T427745 [13:57:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295918 (https://phabricator.wikimedia.org/T427745) (owner: 10Codename Noreste) [13:58:03] (03CR) 10Zabe: "I could also do something horible like canceling the sync after it reached the canaries and let it sit there for a few minutes and see wha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [13:58:36] (03Merged) 10jenkins-bot: Remove sfsblock-bypass from the IP block exemption user group on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295918 (https://phabricator.wikimedia.org/T427745) (owner: 10Codename Noreste) [13:58:53] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1295918|Remove sfsblock-bypass from the IP block exemption user group on all wikis (T427745)]] [13:58:56] T427745: Remove sfsblock-bypass from ipblock-exempt group - https://phabricator.wikimedia.org/T427745 [14:00:36] (03CR) 10Atsuko: [C:03+2] "applied" [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:01:12] (03CR) 10Ladsgroup: "it wouldn't even make it to the list of top ten horrible things we have done!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [14:01:34] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:01:56] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:sessionstore [14:02:07] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [14:02:34] (03PS4) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) [14:02:38] (03PS3) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) [14:03:23] Lucas_WMDE when I used WikimediaDebug the sfsblock-bypass right is no longer there (from the test servers) [14:03:31] Should be okay to deploy [14:03:35] … [14:03:39] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, codenamenoreste: Backport for [[gerrit:1295918|Remove sfsblock-bypass from the IP block exemption user group on all wikis (T427745)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:03:45] codenamenoreste: please test *now* [14:04:11] (03CR) 10Btullis: Configure nginx to log requests in ECS format to syslog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:04:15] testing early just makes everything more confusing because I don’t know which server with which version you hit [14:05:07] FIRING: ProbeDown: Service sessionstore2004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:05:11] it works! [14:05:47] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, codenamenoreste: Continuing with deployment [14:05:50] alright, thanks! [14:06:10] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [14:06:12] (03PS2) 10Ayounsi: Add Mayotte to geo-maps - prefer drmrs [dns] - 10https://gerrit.wikimedia.org/r/1267042 [14:08:29] (03CR) 10Brouberol: [C:03+1] logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [14:09:59] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295918|Remove sfsblock-bypass from the IP block exemption user group on all wikis (T427745)]] (duration: 11m 06s) [14:10:00] (03CR) 10Ssingh: Add Mayotte to geo-maps - prefer drmrs [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [14:10:03] T427745: Remove sfsblock-bypass from ipblock-exempt group - https://phabricator.wikimedia.org/T427745 [14:10:07] RESOLVED: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:08] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:10:08] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:11:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:11:08] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:11:34] jouncebot: nowandnext [14:11:34] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [14:11:35] In 0 hour(s) and 18 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1430) [14:11:45] (03CR) 10Brouberol: [C:03+1] Declare the webrequest.dumps.dev0 stream in EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295922 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [14:11:54] !log UTC afternoon backport+config window done [14:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:59] (a few minutes ago) [14:12:11] cool cool. I was about to ping you [14:12:46] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [14:12:52] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [14:13:11] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295930 (https://phabricator.wikimedia.org/T421797) [14:15:08] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:16:05] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [14:17:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:17:51] (03CR) 10Ladsgroup: [C:03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295930 (https://phabricator.wikimedia.org/T421797) (owner: 10Ladsgroup) [14:17:58] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [14:19:19] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295930 (https://phabricator.wikimedia.org/T421797) (owner: 10Ladsgroup) [14:19:43] (03PS1) 10Jgiannelos: tegola: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295932 [14:20:07] FIRING: [4x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:08] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:20:16] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Unrack old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427283#11972441 (10ayounsi) p:05Triage→03Low [14:20:41] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11972454 (10ayounsi) p:05Triage→03Medium [14:21:49] (03PS5) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) [14:21:49] (03PS4) 10Btullis: Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) [14:21:50] (03PS4) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) [14:21:52] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:22:05] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [14:22:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:23:12] (03PS19) 10Ayounsi: WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [14:23:22] (03CR) 10Ssingh: [C:03+2] Add Mayotte to geo-maps - prefer drmrs [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [14:23:26] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#11972468 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:23:34] !log sukhe@dns1004 START - running authdns-update [14:25:07] RESOLVED: [4x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:08] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:25:15] !log sukhe@dns1004 END - running authdns-update [14:25:23] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:26:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:27:16] !log ladsgroup@deploy1003 Synchronized portals/wikipedia.org/assets: Deploy portals (T421797) (duration: 06m 10s) [14:27:19] T421797: Remove Wikinews from various multilingual portals - https://phabricator.wikimedia.org/T421797 [14:27:38] (03CR) 10Atsuko: [C:03+2] Cleanup old values for turnilo and eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295405 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:28:01] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [14:29:48] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [14:29:50] 06SRE, 06Infrastructure-Foundations, 07Puppet (Puppet 7.0): slow command processing when introducing lots of new hosts - https://phabricator.wikimedia.org/T341674#11972506 (10LSobanski) 05Open→03Declined This was a one-off problem and we've fully migrated to Puppet 7 now. Resolving. [14:30:00] !log ladsgroup@deploy1003 Synchronized portals: Deploy portals (T421797) (duration: 02m 43s) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1430) [14:30:08] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:30:08] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:30:22] FIRING: [4x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:29] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [14:31:07] (03PS11) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [14:31:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:31:08] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:33:25] (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [14:33:34] (03CR) 10JMeybohm: [C:03+1] docker_registry: replace rdb2009 with rdb2013 [puppet] - 10https://gerrit.wikimedia.org/r/1294279 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [14:34:02] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 07Puppet (Puppet 7.0): remove puppet::expose_agent_certs from puppetdb classes - https://phabricator.wikimedia.org/T341374#11972566 (10LSobanski) p:05Medium→03Low [14:34:16] (03CR) 10CI reject: [V:04-1] sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [14:34:17] (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [14:34:30] (03CR) 10JMeybohm: [C:03+1] ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [14:35:22] RESOLVED: [4x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:55] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Decom cookbook should only warn about unexpected matches in Puppet - https://phabricator.wikimedia.org/T297516#11972594 (10LSobanski) This looks resolved. @RLazarus please reopen if you think otherwise. [14:36:10] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [14:36:22] (03Merged) 10jenkins-bot: Cleanup old values for turnilo and eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295405 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:37:29] (03PS6) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) [14:37:29] (03PS5) 10Btullis: Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) [14:37:29] (03PS5) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) [14:37:45] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:37:51] 06SRE, 10Observability-Alerting, 10Puppet-Core, 13Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253#11972611 (10LSobanski) Untagging IF. [14:37:51] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:37:58] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [14:38:28] (03PS12) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [14:38:45] 06SRE, 10netops, 06Traffic-Icebox: experiment with reenabling compression between applayer's TLS terminators and edge caches - https://phabricator.wikimedia.org/T263288#11972619 (10LSobanski) Untagging IF. [14:39:44] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990#11972623 (10LSobanski) p:05Medium→03Low [14:40:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93440 and previous config saved to /var/cache/conftool/dbconfig/20260601-144002-fceratto.json [14:41:08] 06SRE, 06Infrastructure-Foundations, 10provisioning-automation: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875#11972624 (10LSobanski) [14:41:24] 06SRE, 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Unplanned: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354295#11972625 (10fnegri) p:05High→03Low a:05dcaro→03None > So this task is to remove any... [14:41:59] !log atsuko@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:42:10] FIRING: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:14] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [14:42:18] !log atsuko@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:43:19] (03PS3) 10CDanis: cache::haproxy: limit email addresses to reasonable lengths [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto) [14:43:21] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto) [14:44:52] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11972676 (10VRiley-WMF) 05In progress→03Resolved Thank you for the update! Closing this. Please let us know if anything else happens! [14:45:19] (03CR) 10CDanis: [C:03+2] cache::haproxy: limit email addresses to reasonable lengths [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto) [14:46:20] (03CR) 10Btullis: "I think that this is done." [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:47:10] RESOLVED: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:48:01] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [14:49:02] 10SRE-Access-Requests: Adding FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T427823 (10jasmine_) 03NEW [14:49:22] !log atsuko@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:49:29] (03CR) 10JMeybohm: "This looks pretty good, thanks! Two minor nits inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [14:49:52] !log atsuko@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:50:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93441 and previous config saved to /var/cache/conftool/dbconfig/20260601-145010-fceratto.json [14:50:15] (03PS7) 10Andrew Bogott: designate: remove leftover mcrouter code [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T427189) [14:50:38] !log atsuko@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [14:50:54] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T427189) (owner: 10Andrew Bogott) [14:51:05] !log atsuko@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:51:54] (03PS1) 10Federico Ceratto: db1224: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295940 (https://phabricator.wikimedia.org/T427535) [14:51:59] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:52:05] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum5003.eqsin.wmnet with OS trixie [14:52:20] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1209: Upgrading db1209.eqiad.wmnet [14:52:36] !log atsuko@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:52:51] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1209: Upgrading db1209.eqiad.wmnet [14:52:52] !log atsuko@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:53:07] FIRING: ProbeDown: Service sessionstore1005-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore1005-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:54:04] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1295913 (https://phabricator.wikimedia.org/T149804) (owner: 10Majavah) [14:54:18] (03PS5) 10Atsuko: Cleanup eventstream-internal [puppet] - 10https://gerrit.wikimedia.org/r/1295406 (https://phabricator.wikimedia.org/T348763) [14:54:36] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge:legacy_redirector: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295913 (https://phabricator.wikimedia.org/T149804) (owner: 10Majavah) [14:54:57] (03CR) 10Muehlenhoff: [C:03+2] profile::mariadb::proxy: Use Puppet types [puppet] - 10https://gerrit.wikimedia.org/r/1294258 (owner: 10Muehlenhoff) [14:55:06] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1209.eqiad.wmnet with OS trixie [14:55:08] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:56:25] FIRING: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:31] (03CR) 10Muehlenhoff: [C:03+2] toolforge::elasticsearch::haproxy: Restrict to cloud network [puppet] - 10https://gerrit.wikimedia.org/r/1295398 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [14:57:19] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11972753 (10MoritzMuehlenhoff) [14:57:33] (03PS1) 10Jasmine: admin: replacing spare FIDO backed key [puppet] - 10https://gerrit.wikimedia.org/r/1295941 (https://phabricator.wikimedia.org/T427823) [14:58:07] RESOLVED: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:59:18] (03CR) 10Brouberol: [C:03+1] Cleanup eventstream-internal [puppet] - 10https://gerrit.wikimedia.org/r/1295406 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:59:53] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11972766 (10Dzahn) [15:00:15] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11972767 (10Dzahn) confirmed SSH key out of band [15:00:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93443 and previous config saved to /var/cache/conftool/dbconfig/20260601-150017-fceratto.json [15:03:12] (03CR) 10Brouberol: [C:04-1] "The ports collide with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295396 I think the ports assigned to the previous opensearch " [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [15:04:37] FIRING: [2x] ProbeDown: Service sessionstore1005-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:05:37] (03PS2) 10Jasmine: admin: replacing spare FIDO backed key [puppet] - 10https://gerrit.wikimedia.org/r/1295941 (https://phabricator.wikimedia.org/T427823) [15:06:25] FIRING: [4x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:27] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:sessionstore [15:08:36] (03PS13) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [15:09:37] RESOLVED: [4x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93445 and previous config saved to /var/cache/conftool/dbconfig/20260601-151024-fceratto.json [15:10:35] (03PS1) 10Muehlenhoff: autoinstall: Switch to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1295945 (https://phabricator.wikimedia.org/T416707) [15:10:46] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1209.eqiad.wmnet with reason: host reimage [15:10:51] jouncebot: nowandnext [15:10:52] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [15:10:52] In 0 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1530) [15:10:53] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for VisualEditor on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295946 (https://phabricator.wikimedia.org/T425940) [15:11:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295946 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [15:12:09] (03Merged) 10jenkins-bot: hCaptcha: Enable for VisualEditor on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295946 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [15:12:16] (03PS1) 10Kamila Součková: CI: Fix CI pass on template render fail [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295947 (https://phabricator.wikimedia.org/T427307) [15:12:20] (03CR) 10Muehlenhoff: [C:03+2] autoinstall: Switch to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1295945 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [15:12:24] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1295946|hCaptcha: Enable for VisualEditor on all WMF wikis (T425940)]] [15:12:27] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [15:12:46] (03CR) 10CWilliams: [C:03+1] db1224: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295940 (https://phabricator.wikimedia.org/T427535) (owner: 10Federico Ceratto) [15:12:52] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:13:09] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:14:04] (03CR) 10Federico Ceratto: [C:03+2] db1224: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295940 (https://phabricator.wikimedia.org/T427535) (owner: 10Federico Ceratto) [15:14:09] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1295946|hCaptcha: Enable for VisualEditor on all WMF wikis (T425940)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:14:31] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1209.eqiad.wmnet with reason: host reimage [15:14:43] (03PS1) 10Kamila Součková: .fixtures: remove erroneously committed file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295949 [15:16:37] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [15:17:57] I have a config patch to sync when Dreamy_Jazz is done [15:18:05] (03CR) 10Atsuko: [C:03+2] Cleanup eventstream-internal [puppet] - 10https://gerrit.wikimedia.org/r/1295406 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [15:18:21] (03PS1) 10Ottomata: mediawiki.user_change.dev0 - key by user.wiki_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295950 (https://phabricator.wikimedia.org/T426198) [15:19:28] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:19:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1295941 (https://phabricator.wikimedia.org/T427823) (owner: 10Jasmine) [15:19:33] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:19:37] (03CR) 10Muehlenhoff: [C:03+2] admin: replacing spare FIDO backed key [puppet] - 10https://gerrit.wikimedia.org/r/1295941 (https://phabricator.wikimedia.org/T427823) (owner: 10Jasmine) [15:19:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:48] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295946|hCaptcha: Enable for VisualEditor on all WMF wikis (T425940)]] (duration: 08m 24s) [15:20:52] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [15:20:57] kostajh: Your turn [15:21:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295802 (owner: 10Kosta Harlan) [15:21:17] Dreamy_Jazz: thx [15:22:05] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:22:09] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:22:14] (03Merged) 10jenkins-bot: hCaptcha: Raise SiteVerify error threshold to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295802 (owner: 10Kosta Harlan) [15:22:18] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:22:23] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:22:29] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1295802|hCaptcha: Raise SiteVerify error threshold to 100]] [15:22:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2005.codfw.wmnet with OS bullseye [15:24:15] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1295802|hCaptcha: Raise SiteVerify error threshold to 100]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:24:35] !log kharlan@deploy1003 kharlan: Continuing with deployment [15:24:58] (03PS3) 10Kosta Harlan: hCaptcha: Load self-hosted secure-api.js on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829) [15:24:58] (03PS3) 10Kosta Harlan: hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) [15:25:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling [15:25:43] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1224: Pooling [15:25:58] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1224.eqiad.wmnet [15:25:58] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1224.eqiad.wmnet [15:26:07] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling [15:26:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:26:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93446 and previous config saved to /var/cache/conftool/dbconfig/20260601-152638-fceratto.json [15:27:26] (03PS1) 10Dzahn: admin: upgrade Mahmoud Abdelsattar from ldap_only to shell user [puppet] - 10https://gerrit.wikimedia.org/r/1295952 (https://phabricator.wikimedia.org/T427597) [15:28:45] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295802|hCaptcha: Raise SiteVerify error threshold to 100]] (duration: 06m 15s) [15:29:42] (03PS2) 10Atsuko: service: services_proxy: prod opensearch-on-k8s services [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) [15:30:04] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1530). [15:31:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1209.eqiad.wmnet with OS trixie [15:33:20] (03CR) 10TChin: [C:03+1] mediawiki.user_change.dev0 - key by user.wiki_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295950 (https://phabricator.wikimedia.org/T426198) (owner: 10Ottomata) [15:33:38] (03PS1) 10Eevans: linked-artifacts: deploy v1.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295954 (https://phabricator.wikimedia.org/T427508) [15:34:36] (03PS2) 10Atsuko: translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294949 (https://phabricator.wikimedia.org/T425377) [15:36:17] (03CR) 10Eevans: [C:03+2] linked-artifacts: deploy v1.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295954 (https://phabricator.wikimedia.org/T427508) (owner: 10Eevans) [15:37:46] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [15:37:59] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage [15:38:44] (03Merged) 10jenkins-bot: linked-artifacts: deploy v1.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295954 (https://phabricator.wikimedia.org/T427508) (owner: 10Eevans) [15:38:47] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1224: Pooling [15:38:51] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling [15:39:09] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [15:39:12] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1209: Migration of db1209.eqiad.wmnet completed [15:39:21] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [15:39:40] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [15:40:01] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1224.eqiad.wmnet [15:40:01] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1224.eqiad.wmnet [15:40:08] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1224.eqiad.wmnet [15:40:09] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1224.eqiad.wmnet [15:40:13] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1224: Pooling [15:40:16] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling [15:42:22] (03CR) 10JMeybohm: [C:03+1] .fixtures: remove erroneously committed file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295949 (owner: 10Kamila Součková) [15:42:50] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update outlink-topic-model docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295899 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [15:44:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage [15:44:56] (03CR) 10JMeybohm: [C:04-1] CI: Fix CI pass on template render fail (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295947 (https://phabricator.wikimedia.org/T427307) (owner: 10Kamila Součková) [15:45:12] (03Merged) 10jenkins-bot: ml-services: Update outlink-topic-model docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295899 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [15:45:22] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1224: Pooling [15:45:28] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973017 (10ops-monitoring-bot) Starting pool of db1224 by fceratto@cumin1003: Pooling [15:45:31] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973018 (10ops-monitoring-bot) Completed pooling of db1224 by fceratto@cumin1003: Pooling [15:48:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [15:49:08] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling [15:49:14] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1224: Pooling [15:49:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973047 (10ops-monitoring-bot) Starting pool of db1224 by fceratto@cumin1003: Pooling [15:49:21] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973048 (10ops-monitoring-bot) Completed pooling of db1224 by fceratto@cumin1003: Pooling [15:50:35] sukhe@cumin1003 reimage (PID 3686757) is awaiting input [15:51:08] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum5003.eqsin.wmnet with OS trixie [15:53:05] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11973050 (10VRiley-WMF) Hey @MatthewVernon thanks for the response on the other ticket. I know tuesdays get a bit meeting heavy for mys... [15:53:26] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1224: Pooling [15:56:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2005.codfw.wmnet with OS bullseye [15:56:46] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.depool (exit_code=97) depool db1224: Pooling [15:56:55] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Pooling [15:57:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973082 (10ops-monitoring-bot) Starting pool of db1224 by fceratto@cumin1003: Pooling [16:00:34] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11973097 (10MatthewVernon) Hi @VRiley-WMF I have oddly-full afternoons on other days at the moment; I could do 14:30-16:30 UTC on Wedne... [16:01:18] (03CR) 10Atsuko: "Updated the ports to 65xx, checked that there's no collisions." [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [16:01:35] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:01:36] (03PS1) 10Muehlenhoff: autoinstall: Switch to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1295956 (https://phabricator.wikimedia.org/T416707) [16:01:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11973102 (10VRiley-WMF) @MatthewVernon Wednesday the 3rd absolutely works for me! we can start then [16:02:30] !log temporarily remove ganeti2027 from the codfw cluster T427357 [16:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:33] T427357: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357 [16:03:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet [16:04:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet [16:04:57] PROBLEM - ganeti-noded running on ganeti2027 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:04:57] PROBLEM - ganeti-confd running on ganeti2027 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [16:05:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to drbd [16:05:50] FIRING: ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:06:01] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:07:57] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:56] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:09:18] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1236: Update [16:09:48] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1236: Update [16:10:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1236.eqiad.wmnet with reason: Kernel update T426633 [16:10:41] (03PS1) 10Bartosz Wójtowicz: ml-services: Bump llm ns memory quota to 256Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295958 [16:24:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:42] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1209: Migration of db1209.eqiad.wmnet completed [16:24:43] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [16:25:41] (03CR) 10JHathaway: [C:03+1] role::pki::multirootca: remove the Kafka kafka_11 profile [puppet] - 10https://gerrit.wikimedia.org/r/1295023 (owner: 10Elukey) [16:25:47] PROBLEM - Host aux-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93455 and previous config saved to /var/cache/conftool/dbconfig/20260601-162653-fceratto.json [16:27:23] (03PS1) 10Marco Fossati: Carousel: Defer to MobileFrontend lightbox on mobile [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295962 (https://phabricator.wikimedia.org/T427679) [16:29:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to drbd [16:29:27] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1236: Update [16:29:39] RECOVERY - Host aux-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.78 ms [16:29:53] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1236: Update [16:30:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295962 (https://phabricator.wikimedia.org/T427679) (owner: 10Marco Fossati) [16:30:44] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1236: Update [16:30:54] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1236.eqiad.wmnet [16:30:55] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1236.eqiad.wmnet [16:31:06] (03CR) 10Marco Fossati: [C:03+1] Carousel: Defer to MobileFrontend lightbox on mobile [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295962 (https://phabricator.wikimedia.org/T427679) (owner: 10Marco Fossati) [16:31:13] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:31:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1236.eqiad.wmnet with reason: Kernel update T426633 [16:34:13] (03CR) 10JHathaway: sre.puppet.disable-merges: New cookbook to disable Puppet merges temporarily (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1295425 (https://phabricator.wikimedia.org/T248872) (owner: 10Muehlenhoff) [16:34:19] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:34:21] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:34:27] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1236: Update [16:34:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1236: Update [16:35:03] !log ryankemper@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,cluster=wdqs,service=wdqs-main,name=wdqs1015.eqiad.wmnet [16:35:30] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1236.eqiad.wmnet [16:35:31] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1236.eqiad.wmnet [16:35:55] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:40] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:37:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93458 and previous config saved to /var/cache/conftool/dbconfig/20260601-163701-fceratto.json [16:37:12] !log ryankemper@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,cluster=wdqs-main,service=wdqs-main,name=wdqs1015.eqiad.wmnet [16:41:59] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [16:42:24] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1224: Pooling [16:42:29] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11973208 (10ops-monitoring-bot) Completed pooling of db1224 by fceratto@cumin1003: Pooling [16:47:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93460 and previous config saved to /var/cache/conftool/dbconfig/20260601-164709-fceratto.json [16:47:16] (03PS3) 10Dzahn: gerrit: use stunnel with rsync of lfs data [puppet] - 10https://gerrit.wikimedia.org/r/1295500 (https://phabricator.wikimedia.org/T412780) [16:47:21] 06SRE, 10SRE-Access-Requests: Adding FIDO backed key [Jasmine] - https://phabricator.wikimedia.org/T427823#11973222 (10jasmine_) 05Open→03Resolved a:03jasmine_ [16:47:41] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11973231 (10MatthewVernon) Cool, I've blocked that out in my calendar :) [16:50:50] RESOLVED: ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Investigate hardware RAID usage in codfw LVS hosts - https://phabricator.wikimedia.org/T426912#11973277 (10BCornwall) 05Open→03Resolved That's a fair point, and considering we're on nvme drives power loss is less of a concern as well since it's non-volatile.... [16:51:25] (03CR) 10Brouberol: [C:03+1] "Nice and thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [16:51:49] (03CR) 10Jasmine: [C:03+2] kafka-main2006: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288917 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [16:52:06] (03CR) 10Atsuko: [C:03+2] service: services_proxy: prod opensearch-on-k8s services [puppet] - 10https://gerrit.wikimedia.org/r/1295901 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [16:52:54] (03CR) 10Daniel Kinzler: Rakefile: Run chart specific tests (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [16:53:32] (03PS1) 10Dzahn: gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 [16:54:10] (03CR) 10CI reject: [V:04-1] gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (owner: 10Dzahn) [16:56:15] (03PS1) 10Marco Fossati: MultimediaViewer: enable image carousel as a beta feature on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295968 (https://phabricator.wikimedia.org/T426799) [16:57:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93462 and previous config saved to /var/cache/conftool/dbconfig/20260601-165717-fceratto.json [16:57:30] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS trixie [16:57:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295968 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [16:58:13] !log drop flaggedrevs tables on wikinews wikis (T423577) [16:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:16] T423577: Undeploy FlaggedRevs from Wikinews and drop FlaggedRevs tables - https://phabricator.wikimedia.org/T423577 [16:59:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd2001.codfw.wmnet to drbd [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1700) [17:00:04] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1700). [17:01:35] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:03:37] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1180: Pooling [17:03:41] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1180: Pooling [17:03:58] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1180: Pooling [17:04:01] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1180: Pooling [17:04:08] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1180: Pooling [17:04:10] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1180: Pooling [17:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:10:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [17:10:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd2001.codfw.wmnet to drbd [17:10:13] PROBLEM - Host dse-k8s-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:55] RECOVERY - Host dse-k8s-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.80 ms [17:16:29] (03CR) 10Majavah: [C:04-1] "This needs more context/a task attached (why are we building a system to track individual users), but also should not be tied to individua" [puppet] - 10https://gerrit.wikimedia.org/r/1294864 (owner: 10Komla Sapaty) [17:20:21] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1236: Update [17:20:39] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T427553#11973388 (10Raine) [17:22:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T427553#11973401 (10Raine) @Milimetric @Ahoelzl @Ottomata can one of you please approve? Thanks! [17:28:40] (03PS1) 10Chlod Alejandro: nlwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295976 (https://phabricator.wikimedia.org/T424519) [17:29:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T427553#11973446 (10Raine) >>! In T427553#11973400, @Raine wrote: > @Milimetric @Ahoelzl @Ottomata can one of you please approve? Thanks! Apologies, I hadn't realised this... [17:31:23] jouncebot: nowandnext [17:31:23] For the next 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T1700) [17:31:24] In 2 hour(s) and 28 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T2000) [17:32:10] chlod: o/ [17:32:14] \o/ [17:32:25] I am going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1295976, a logos change [17:32:36] (03PS1) 10Audrey Penven: Update config for WikiProjects linking prototype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295978 (https://phabricator.wikimedia.org/T427804) [17:33:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T427553#11973471 (10Milimetric) approved [17:33:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295976 (https://phabricator.wikimedia.org/T424519) (owner: 10Chlod Alejandro) [17:33:45] (03PS1) 10Kamila Součková: admin: add apdube-wmf user [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) [17:34:16] (03Merged) 10jenkins-bot: nlwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295976 (https://phabricator.wikimedia.org/T424519) (owner: 10Chlod Alejandro) [17:34:31] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1295976|nlwiki: change to Wikipedia 25 logo (T424519)]] [17:34:33] (03CR) 10CI reject: [V:04-1] admin: add apdube-wmf user [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) (owner: 10Kamila Součková) [17:34:35] T424519: Per community Rfc, for the month of June please change to the birthday logo for nl.wikipedia.org - https://phabricator.wikimedia.org/T424519 [17:36:25] !log samtar@deploy1003 chlod, samtar: Backport for [[gerrit:1295976|nlwiki: change to Wikipedia 25 logo (T424519)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:36:34] checking [17:37:31] looks good :) [17:37:45] !log samtar@deploy1003 chlod, samtar: Continuing with deployment [17:39:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:39:44] Deployment eventstreams-production in eventstreams at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=eventstreams&var-deployment=eventstreams-production - ... [17:39:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:42:01] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295976|nlwiki: change to Wikipedia 25 logo (T424519)]] (duration: 07m 29s) [17:42:04] T424519: Per community Rfc, for the month of June please change to the birthday logo for nl.wikipedia.org - https://phabricator.wikimedia.org/T424519 [17:42:19] lgtm on prod now [17:42:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:57] likewise, thank you TheresNoTIme! [17:43:05] np! [17:44:28] 06SRE, 06Traffic: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836 (10ssingh) 03NEW [17:44:30] 06SRE, 06Traffic: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#11973510 (10ssingh) p:05Triage→03Medium [17:53:03] !log jasmine@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2006.codfw.wmnet with OS trixie [17:58:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet [17:59:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet [18:01:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain [18:01:32] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:01:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain [18:02:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd2001.codfw.wmnet to plain [18:02:46] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [18:03:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd2001.codfw.wmnet to plain [18:03:17] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker1117.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:04:17] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:04:19] (03CR) 10Ottomata: "Ahhh! Got it! Great, so this produces to event platform streams, then logstash just consumes them. Okay!" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [18:05:07] (03CR) 10Ottomata: [C:03+1] Configure nginx to log requests in ECS format to syslog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [18:05:11] (03CR) 10Ottomata: [C:03+1] Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [18:05:20] (03CR) 10Ottomata: [C:03+1] logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [18:05:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet [18:06:07] (03CR) 10Ottomata: [C:03+1] Declare the webrequest.dumps.dev0 stream in EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295922 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [18:06:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet [18:14:39] (03CR) 10Ottomata: flink-app - default to setting metrics.internal.query-service.port (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [18:14:42] (03PS3) 10Ottomata: flink-app - default to setting metrics.internal.query-service.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) [18:14:45] PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1002 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [18:14:51] (03CR) 10Ottomata: [C:03+2] flink-app - default to setting metrics.internal.query-service.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [18:16:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295950 (https://phabricator.wikimedia.org/T426198) (owner: 10Ottomata) [18:16:39] (03PS1) 10Jdlrobson: styles: Hide donor badge container by default [skins/MinervaNeue] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295990 (https://phabricator.wikimedia.org/T425450) [18:17:25] (03Merged) 10jenkins-bot: mediawiki.user_change.dev0 - key by user.wiki_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295950 (https://phabricator.wikimedia.org/T426198) (owner: 10Ottomata) [18:17:38] !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1295950|mediawiki.user_change.dev0 - key by user.wiki_id (T426198)]] [18:17:41] T426198: Event schemas - mediawiki user entity should be wiki aware - https://phabricator.wikimedia.org/T426198 [18:17:49] (03Merged) 10jenkins-bot: flink-app - default to setting metrics.internal.query-service.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [18:19:28] !log otto@deploy1003 otto: Backport for [[gerrit:1295950|mediawiki.user_change.dev0 - key by user.wiki_id (T426198)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:20:05] !log otto@deploy1003 otto: Continuing with deployment [18:23:04] (03CR) 10Volans: [C:04-1] "The idea of the patch is fine, it's a nice addition and I can see when it could be useful." [software/cumin] - 10https://gerrit.wikimedia.org/r/1294990 (owner: 10CDanis) [18:24:20] !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295950|mediawiki.user_change.dev0 - key by user.wiki_id (T426198)]] (duration: 06m 42s) [18:24:23] T426198: Event schemas - mediawiki user entity should be wiki aware - https://phabricator.wikimedia.org/T426198 [18:24:37] (03Abandoned) 10Dr0ptp4kt: Reactivate wikimedia.de email addresses for GrowthBook SSO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294372 (https://phabricator.wikimedia.org/T418665) (owner: 10Dr0ptp4kt) [18:29:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [18:29:44] Deployment eventstreams-production in eventstreams at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=eventstreams&var-deployment=eventstreams-production - ... [18:29:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:38:10] (03PS1) 10JHathaway: mx: honor reject policy for DMARC [puppet] - 10https://gerrit.wikimedia.org/r/1295992 (https://phabricator.wikimedia.org/T404884) [18:38:25] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295992 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway) [18:44:45] RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1002 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [18:45:21] EventStreams is flapping. Container OOMs not sure why. [18:45:21] https://phabricator.wikimedia.org/T427839 [18:49:09] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295992 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway) [18:51:40] (03CR) 10JHathaway: [C:03+2] mx: honor reject policy for DMARC [puppet] - 10https://gerrit.wikimedia.org/r/1295992 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway) [18:53:40] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and Hurricane Electric (2001:504:0:1::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:57:52] FIRING: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:xe-0/1/4 (Peering: Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:00:36] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: sync [19:01:05] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync [19:01:56] ottomata: working on kafka-main trixie upgrade, (tested on a single host (kafka-main2006) and failed, currently looking through logs) perhaps might it be related? [19:02:15] re: EventStreams^ [19:03:46] ah nvm looks like it's been flapping from before the reimage [19:19:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:23:13] (03CR) 10Dzahn: [C:03+2] gerrit: use stunnel with rsync of lfs data [puppet] - 10https://gerrit.wikimedia.org/r/1295500 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn) [19:28:05] (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.24.0-a7 [vendor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296002 (https://phabricator.wikimedia.org/T353697) [19:29:00] (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.24.0-a7 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) [19:30:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) (owner: 10Arlolra) [19:34:55] jasmine_: yeah since last wednesday, thanks for checking though [19:36:33] its on all pods, and looks present in codfw, although less quickly since there are fewer connections there. [19:36:36] i'm going to try to rever [19:36:36] t [19:36:53] to what we had before last wed. no idea why this would be happening though [19:40:05] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1355757424 and 106 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:40:23] (03PS1) 10Ottomata: eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296004 (https://phabricator.wikimedia.org/T427839) [19:42:05] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 52216 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:42:38] (03CR) 10Ottomata: [C:03+2] eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296004 (https://phabricator.wikimedia.org/T427839) (owner: 10Ottomata) [19:43:40] (03CR) 10Btullis: [C:03+1] eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296004 (https://phabricator.wikimedia.org/T427839) (owner: 10Ottomata) [19:45:59] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [19:46:09] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [19:46:56] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [19:47:09] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [19:47:57] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [19:48:29] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [19:54:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:57] RESOLVED: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T2000) [20:00:05] sfaci, RoanKattouw, xxb, jdlrobson, and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:37] I can deploy [20:01:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295504 (https://phabricator.wikimedia.org/T427562) (owner: 10Catrope) [20:01:41] hii [20:01:49] i'm here for Santi's patch - it can ride along with other config patches [20:03:26] o/ [20:05:00] (03Merged) 10jenkins-bot: passwordlessLogin: Don't immediately error out in unsupported browsers [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295504 (https://phabricator.wikimedia.org/T427562) (owner: 10Catrope) [20:05:16] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1295504|passwordlessLogin: Don't immediately error out in unsupported browsers (T427562)]] [20:05:19] T427562: Users without passkeys and without passkey support in their browser cannot login - https://phabricator.wikimedia.org/T427562 [20:06:12] (03PS1) 10Catrope: Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) [20:07:00] !log catrope@deploy1003 catrope: Backport for [[gerrit:1295504|passwordlessLogin: Don't immediately error out in unsupported browsers (T427562)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:49] (03PS2) 10Dzahn: gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 [20:08:42] !log catrope@deploy1003 catrope: Continuing with deployment [20:08:43] (03CR) 10CI reject: [V:04-1] gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (owner: 10Dzahn) [20:09:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope) [20:12:53] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295504|passwordlessLogin: Don't immediately error out in unsupported browsers (T427562)]] (duration: 07m 37s) [20:12:57] T427562: Users without passkeys and without passkey support in their browser cannot login - https://phabricator.wikimedia.org/T427562 [20:13:08] (03PS3) 10Dzahn: gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) [20:13:23] Next I'll do Santi's patch (cc cjming) together with xxb's patch [20:13:35] thanks Roan!! [20:14:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285412 (https://phabricator.wikimedia.org/T422358) (owner: 10Santiago Faci) [20:14:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295531 (https://phabricator.wikimedia.org/T427384) (owner: 10XXBlackburnXx) [20:15:19] (03CR) 10CI reject: [V:04-1] gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn) [20:18:12] (03Merged) 10jenkins-bot: Remove `wgTestKitchenExperimentStreamNames` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285412 (https://phabricator.wikimedia.org/T422358) (owner: 10Santiago Faci) [20:18:16] (03Merged) 10jenkins-bot: Enable AbuseFilter block action on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295531 (https://phabricator.wikimedia.org/T427384) (owner: 10XXBlackburnXx) [20:18:30] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1285412|Remove `wgTestKitchenExperimentStreamNames` (T422358)]], [[gerrit:1295531|Enable AbuseFilter block action on nlwiki (T427384)]] [20:18:35] T422358: Deprecate and remove Experiment#setStream(streamName) - https://phabricator.wikimedia.org/T422358 [20:18:35] T427384: Enable Abusefilter "block" consequence on nlwiki - https://phabricator.wikimedia.org/T427384 [20:20:15] !log catrope@deploy1003 sfaci, xxblackburnxx, catrope: Backport for [[gerrit:1285412|Remove `wgTestKitchenExperimentStreamNames` (T422358)]], [[gerrit:1295531|Enable AbuseFilter block action on nlwiki (T427384)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:20:58] cjming, xxb: Please test your changes (or tell me they're not really testable, they look like they might not be) [20:21:30] mine's a no-op [20:21:50] RoanKattouw: looks good on my end [20:21:54] thanks :) [20:22:09] !log catrope@deploy1003 sfaci, xxblackburnxx, catrope: Continuing with deployment [20:24:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:26:18] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285412|Remove `wgTestKitchenExperimentStreamNames` (T422358)]], [[gerrit:1295531|Enable AbuseFilter block action on nlwiki (T427384)]] (duration: 07m 48s) [20:26:23] T422358: Deprecate and remove Experiment#setStream(streamName) - https://phabricator.wikimedia.org/T422358 [20:26:23] T427384: Enable Abusefilter "block" consequence on nlwiki - https://phabricator.wikimedia.org/T427384 [20:28:43] Jdlrobson: you around? [20:29:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296002 (https://phabricator.wikimedia.org/T353697) (owner: 10Arlolra) [20:29:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) (owner: 10Arlolra) [20:29:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope) [20:34:34] (03PS1) 10Arlolra: Deploy PRV to 5 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296015 (https://phabricator.wikimedia.org/T427851) [20:36:59] 10ops-eqiad, 06DC-Ops: hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T427852 (10RKemper) 03NEW [20:37:02] (03PS1) 10Ottomata: Revert "eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296016 [20:37:40] !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T427852 hw failure [20:37:45] T427852: hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T427852 [20:41:53] (03CR) 10Atsuko: [C:03+1] Revert "eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296016 (owner: 10Ottomata) [20:43:27] (03PS1) 10Ottomata: eventstreams - increase memory to 2.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296018 (https://phabricator.wikimedia.org/T427839) [20:43:34] (03CR) 10Ottomata: [C:03+2] Revert "eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296016 (owner: 10Ottomata) [20:45:15] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a7 [vendor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296002 (https://phabricator.wikimedia.org/T353697) (owner: 10Arlolra) [20:45:20] (03CR) 10CI reject: [V:04-1] Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope) [20:45:35] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.24.0-a7 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) (owner: 10Arlolra) [20:45:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:45:54] (03Merged) 10jenkins-bot: Revert "eventstreams - revert to helmfile at 6020329a1f1dbd9c9625cd9c97289d44e4b8271e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296016 (owner: 10Ottomata) [20:46:22] !incidents [20:46:22] 8038 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [20:46:22] 8037 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [20:46:23] 8036 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [20:46:23] 8035 (RESOLVED) [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw) [20:47:24] (03CR) 10Catrope: [C:03+2] Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope) [20:47:31] (03CR) 10Catrope: [C:03+2] Bump wikimedia/parsoid to 0.24.0-a7 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) (owner: 10Arlolra) [20:47:38] Looks like a CI issue, retrying [20:49:29] (03CR) 10Ottomata: [C:03+2] eventstreams - increase memory to 2.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296018 (https://phabricator.wikimedia.org/T427839) (owner: 10Ottomata) [20:50:20] (03CR) 10CI reject: [V:04-1] Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope) [20:51:17] ugh now that one is failing because of the Parsoid version mismatch. I'll retry it after the second Parsoid patch lands [20:51:43] :| [20:51:46] sorry about that [20:51:49] (03Merged) 10jenkins-bot: eventstreams - increase memory to 2.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296018 (https://phabricator.wikimedia.org/T427839) (owner: 10Ottomata) [20:52:32] (03CR) 10Cwhite: Configure nginx to log requests in ECS format to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [20:53:40] !log atsuko@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [20:54:43] (03PS1) 10Ottomata: Revert "eventstreams - increase memory to 2.5Gi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296019 [20:54:52] (03CR) 10Ottomata: [C:03+2] Revert "eventstreams - increase memory to 2.5Gi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296019 (owner: 10Ottomata) [20:55:46] No that's OK, it's CI's fault for randomly failing [20:55:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:55:55] And then I resubmitted the patches in the wrong order [20:56:18] (03CR) 10Catrope: Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope) [20:56:21] (03CR) 10Catrope: [C:03+2] Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope) [20:57:04] (03Merged) 10jenkins-bot: Revert "eventstreams - increase memory to 2.5Gi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296019 (owner: 10Ottomata) [21:00:05] alexsanford, Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T2100). [21:00:21] (03CR) 10Cwhite: [C:03+1] logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [21:00:40] (03CR) 10Cwhite: [C:03+1] "+1 - Errors in this config can cause logging to stop flowing completely." [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [21:01:49] (03PS1) 10Atsuko: eventstreams: set envoy timeout to 0s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296021 (https://phabricator.wikimedia.org/T427839) [21:03:04] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a7 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296003 (https://phabricator.wikimedia.org/T427565) (owner: 10Arlolra) [21:03:08] (03Merged) 10jenkins-bot: Redirect Special:AccountRecovery to the shared domain [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296009 (https://phabricator.wikimedia.org/T427692) (owner: 10Catrope) [21:03:45] (03CR) 10Ottomata: [C:03+1] eventstreams: set envoy timeout to 0s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296021 (https://phabricator.wikimedia.org/T427839) (owner: 10Atsuko) [21:04:20] preparing to deploy a few security patches [21:04:21] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1296002|Bump wikimedia/parsoid to 0.24.0-a7 (T353697 T415591 T427565)]], [[gerrit:1296003|Bump wikimedia/parsoid to 0.24.0-a7 (T427565)]], [[gerrit:1296009|Redirect Special:AccountRecovery to the shared domain (T427692)]] [21:04:23] maryum: I'm still deploying, sorry [21:04:26] no worries [21:04:31] just let me know [21:04:34] T353697: Parsoid/legacy parser {{Pre}} template rendering difference - https://phabricator.wikimedia.org/T353697 [21:04:34] T415591: Template source used in shadow attribute - https://phabricator.wikimedia.org/T415591 [21:04:35] CI was being difficult, sorry for going over time [21:04:35] T427565: CTT tasks week of 2026-05-29 - https://phabricator.wikimedia.org/T427565 [21:04:36] T427692: Special:AccountRecovery never allows itself to be used - https://phabricator.wikimedia.org/T427692 [21:05:29] (03PS1) 10Jdlrobson: Donor Delight Badge: Add dependency on mw.user [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296022 (https://phabricator.wikimedia.org/T427850) [21:05:37] After my deploy maryum should do the security deploys, and then when that's done I can deploy Jdlrobson's patches if he's available by then [21:05:51] sounds good thanks [21:06:07] !log catrope@deploy1003 catrope, arlolra: Backport for [[gerrit:1296002|Bump wikimedia/parsoid to 0.24.0-a7 (T353697 T415591 T427565)]], [[gerrit:1296003|Bump wikimedia/parsoid to 0.24.0-a7 (T427565)]], [[gerrit:1296009|Redirect Special:AccountRecovery to the shared domain (T427692)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:06:16] We have two reg security patches today, and then one change to PS.php that needs to go out... [21:07:05] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 253866936 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:07:46] !log atsuko@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [21:07:58] !log atsuko@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [21:08:22] !log atsuko@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [21:09:05] I tested my patch and it works [21:09:05] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2870200 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:09:10] arlolra: Let me know when you're done testing [21:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:09:15] done, lgtm [21:09:18] !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [21:09:26] !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [21:09:32] RoanKattouw: ^ [21:09:33] !log catrope@deploy1003 catrope, arlolra: Continuing with deployment [21:10:18] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [21:12:46] (03PS2) 10Atsuko: eventstreams: set envoy timeout to 0s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296021 (https://phabricator.wikimedia.org/T427839) [21:13:41] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296002|Bump wikimedia/parsoid to 0.24.0-a7 (T353697 T415591 T427565)]], [[gerrit:1296003|Bump wikimedia/parsoid to 0.24.0-a7 (T427565)]], [[gerrit:1296009|Redirect Special:AccountRecovery to the shared domain (T427692)]] (duration: 09m 20s) [21:13:48] T353697: Parsoid/legacy parser {{Pre}} template rendering difference - https://phabricator.wikimedia.org/T353697 [21:13:49] T415591: Template source used in shadow attribute - https://phabricator.wikimedia.org/T415591 [21:13:49] T427565: CTT tasks week of 2026-05-29 - https://phabricator.wikimedia.org/T427565 [21:13:49] T427692: Special:AccountRecovery never allows itself to be used - https://phabricator.wikimedia.org/T427692 [21:14:01] (03CR) 10Atsuko: [C:03+2] eventstreams: set envoy timeout to 0s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296021 (https://phabricator.wikimedia.org/T427839) (owner: 10Atsuko) [21:14:12] maryum: All yours, please ping me when you're done [21:14:40] yayyyy [21:16:17] (03PS1) 10Reedy: Add maintenance script to scrape SVG render files [extensions/timeline] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296024 [21:16:24] (03Merged) 10jenkins-bot: eventstreams: set envoy timeout to 0s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296021 (https://phabricator.wikimedia.org/T427839) (owner: 10Atsuko) [21:16:41] RoanKattouw: thanks! [21:21:05] running scap for the first security patch [21:21:10] one of two patches to be deployed [21:21:28] then sbassett will do a PS.php deploy after that [21:26:13] (03PS1) 10Jdlrobson: styles: Limit selector to badge client pref [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296028 (https://phabricator.wikimedia.org/T427407) [21:27:22] first scap done preparing to run second scap [21:27:32] !log Deployed security fix for T427235 [21:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:15] (03PS1) 10Zabe: maintain-views: Loosen views for filerevision table [puppet] - 10https://gerrit.wikimedia.org/r/1296029 (https://phabricator.wikimedia.org/T426804) [21:32:59] !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [21:33:08] !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [21:34:36] (03CR) 10Eevans: [C:03+2] Add component/cassandra50 for Cassandra 5.0.x releases [puppet] - 10https://gerrit.wikimedia.org/r/1287923 (https://phabricator.wikimedia.org/T418419) (owner: 10Eevans) [21:35:16] !log atsuko@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [21:35:36] !log Deployed security fix for T427611 [21:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:45] sbassett go ahead with PS.php [21:35:55] !log atsuko@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [21:36:42] (03CR) 10Zabe: "Yeah we should the task first, but I would still do this prior to stop writing to production since otherwise folks will assume the tables " [puppet] - 10https://gerrit.wikimedia.org/r/1281756 (https://phabricator.wikimedia.org/T425191) (owner: 10Zabe) [21:42:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:38] Deploying update to PS.php now… [21:50:43] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [21:51:46] !log Deployed updated mitigation for T326691 [21:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:17] ottomata and atsukoito: thank you [21:54:23] JJMC89: my pleasure [21:56:39] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [21:58:23] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [21:59:10] (03PS1) 10Jdlrobson: Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati) [21:59:38] (03PS2) 10Jdlrobson: Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati) [22:00:56] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [22:06:03] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [22:06:26] (03CR) 10CI reject: [V:04-1] Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati) [22:06:49] Ok, we should be done with security deployments for today. [22:07:45] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [22:11:30] (03PS3) 10Zabe: maintain-views: Drop image and oldimage tables [puppet] - 10https://gerrit.wikimedia.org/r/1281756 (https://phabricator.wikimedia.org/T425191) [22:19:18] (03CR) 10VolkerE: [C:03+1] styles: Limit selector to badge client pref [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296028 (https://phabricator.wikimedia.org/T427407) (owner: 10Jdlrobson) [22:28:17] (03CR) 10Reedy: [C:03+2] Add maintenance script to scrape SVG render files [extensions/timeline] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296024 (owner: 10Reedy) [22:29:18] (03Merged) 10jenkins-bot: Add maintenance script to scrape SVG render files [extensions/timeline] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296024 (owner: 10Reedy) [22:30:08] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1296024|Add maintenance script to scrape SVG render files]] [22:31:54] !log reedy@deploy1003 reedy: Backport for [[gerrit:1296024|Add maintenance script to scrape SVG render files]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:32:16] !log reedy@deploy1003 reedy: Continuing with deployment [22:36:31] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296024|Add maintenance script to scrape SVG render files]] (duration: 06m 22s) [22:38:30] (03CR) 10Ladsgroup: [C:03+1] "I compare it and it looks correct based on what's on oldimage. I'll deploy it tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1296029 (https://phabricator.wikimedia.org/T426804) (owner: 10Zabe) [22:45:07] (03PS4) 10Dzahn: gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) [22:53:55] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and Hurricane Electric (2001:504:0:1::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:54:36] (03PS8) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) [22:56:18] (03CR) 10Aleksandar Mastilovic: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [22:57:52] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:xe-0/1/4 (Peering: Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:58:15] (03CR) 10Aleksandar Mastilovic: "I've added hiera values for the test cluster, too." [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [22:58:19] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:58:19] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:58:40] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and Hurricane Electric (2001:504:0:1::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:59:58] o/ will need to do some deploys in web team deploy window. Please let me know soonish if there is good reason not to. [23:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260601T2300) [23:00:13] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1295967/8626/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn) [23:00:46] Jdlrobson: nothing from the SRE side, have a good deploy [23:00:46] (03PS3) 10Jdlrobson: Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati) [23:01:00] (03PS4) 10Jdlrobson: Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati) [23:01:06] thanks rzl [23:02:13] (03PS5) 10Dzahn: gerrit: use rsync::quickdatacopy in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) [23:02:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296022 (https://phabricator.wikimedia.org/T427850) (owner: 10Jdlrobson) [23:02:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296028 (https://phabricator.wikimedia.org/T427407) (owner: 10Jdlrobson) [23:04:38] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6015.* [23:05:31] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11974498 (10BCornwall) 05Open→03Resolved Checked again after a weekend and things seem fine. Repooling and will check on it again to make double-sure but we should be good. Thanks! [23:05:36] (03Merged) 10jenkins-bot: Donor Delight Badge: Add dependency on mw.user [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296022 (https://phabricator.wikimedia.org/T427850) (owner: 10Jdlrobson) [23:05:38] (03Merged) 10jenkins-bot: styles: Limit selector to badge client pref [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296028 (https://phabricator.wikimedia.org/T427407) (owner: 10Jdlrobson) [23:05:59] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1296022|Donor Delight Badge: Add dependency on mw.user (T427850)]], [[gerrit:1296028|styles: Limit selector to badge client pref (T427407)]] [23:06:04] T427850: TypeError: Cannot read properties of undefined (reading 'set') - https://phabricator.wikimedia.org/T427850 [23:06:05] T427407: Search icon appears on the left on mobile while logged out on certain wikis - https://phabricator.wikimedia.org/T427407 [23:07:24] (03PS1) 10Bvibber: Update name and address for bvibber, drop dead blog from planet [puppet] - 10https://gerrit.wikimedia.org/r/1296038 [23:07:43] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1296022|Donor Delight Badge: Add dependency on mw.user (T427850)]], [[gerrit:1296028|styles: Limit selector to badge client pref (T427407)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:07:47] PROBLEM - Confd vcl based reload on cp6013 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [23:10:41] (03CR) 10RLazarus: [C:03+1] "Happy to +2 and deploy this, just LMK if you're ready." [puppet] - 10https://gerrit.wikimedia.org/r/1296038 (owner: 10Bvibber) [23:11:19] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:11:22] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [23:13:16] (03CR) 10Dzahn: "the goal is for this to be identical to before, removing the custom rsync server config and the "if active_host" around it. the quickdatac" [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn) [23:14:19] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:14:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:15:14] (03CR) 10Dzahn: [C:03+1] Update name and address for bvibber, drop dead blog from planet [puppet] - 10https://gerrit.wikimedia.org/r/1296038 (owner: 10Bvibber) [23:15:32] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296022|Donor Delight Badge: Add dependency on mw.user (T427850)]], [[gerrit:1296028|styles: Limit selector to badge client pref (T427407)]] (duration: 09m 33s) [23:15:38] T427850: TypeError: Cannot read properties of undefined (reading 'set') - https://phabricator.wikimedia.org/T427850 [23:15:38] T427407: Search icon appears on the left on mobile while logged out on certain wikis - https://phabricator.wikimedia.org/T427407 [23:16:19] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:16:54] beginning next set of changes [23:17:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati) [23:17:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295962 (https://phabricator.wikimedia.org/T427679) (owner: 10Marco Fossati) [23:19:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:46] (03Abandoned) 10Jdlrobson: styles: Hide donor badge container by default [skins/MinervaNeue] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295990 (https://phabricator.wikimedia.org/T425450) (owner: 10Jdlrobson) [23:19:56] (03Merged) 10jenkins-bot: Make MultimediaViewer compatible with MobileFrontend legacy parser [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295963 (https://phabricator.wikimedia.org/T427542) (owner: 10Marco Fossati) [23:20:02] (03Merged) 10jenkins-bot: Carousel: Defer to MobileFrontend lightbox on mobile [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295962 (https://phabricator.wikimedia.org/T427679) (owner: 10Marco Fossati) [23:20:20] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1295963|Make MultimediaViewer compatible with MobileFrontend legacy parser (T427542)]], [[gerrit:1295962|Carousel: Defer to MobileFrontend lightbox on mobile (T427679)]] [23:20:24] T427542: [Image Browsing] Carousel: MMV fails to load when clicking on carousel items that correspond to lazy image placeholders (legacy parser only) - https://phabricator.wikimedia.org/T427542 [23:20:25] T427679: [Image Browsing] Carousel: Users should not see desktop MMV experience when clicking an image - https://phabricator.wikimedia.org/T427679 [23:21:19] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:22:03] !log jdlrobson@deploy1003 mfossati, jdlrobson: Backport for [[gerrit:1295963|Make MultimediaViewer compatible with MobileFrontend legacy parser (T427542)]], [[gerrit:1295962|Carousel: Defer to MobileFrontend lightbox on mobile (T427679)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:23:28] !log jdlrobson@deploy1003 mfossati, jdlrobson: Continuing with deployment [23:25:19] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:26:19] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:27:37] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295963|Make MultimediaViewer compatible with MobileFrontend legacy parser (T427542)]], [[gerrit:1295962|Carousel: Defer to MobileFrontend lightbox on mobile (T427679)]] (duration: 07m 17s) [23:27:41] T427542: [Image Browsing] Carousel: MMV fails to load when clicking on carousel items that correspond to lazy image placeholders (legacy parser only) - https://phabricator.wikimedia.org/T427542 [23:27:42] T427679: [Image Browsing] Carousel: Users should not see desktop MMV experience when clicking an image - https://phabricator.wikimedia.org/T427679 [23:29:53] (03CR) 10BCornwall: [C:03+1] "Code looks good but I'd like @joe to verify that this is the way we want to handle it." [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede) [23:30:23] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T427852#11974620 (10Jclark-ctr) This server is out of warranty @rkemper. but I am looking at it right now [23:31:44] (03CR) 10BCornwall: "resetting to 0 as I can't find docs on `X-Image-Generator`" [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede) [23:34:19] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:35:19] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:39:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1296045 [23:39:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1296045 (owner: 10TrainBranchBot) [23:40:19] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:40:35] (done) [23:41:19] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:42:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11974646 (10colewhite) [23:44:20] (03PS1) 10Scott French: scap.cfg.erb: Temporarily pin mediawiki_runtime_image [puppet] - 10https://gerrit.wikimedia.org/r/1296036 (https://phabricator.wikimedia.org/T418200) [23:44:20] (03CR) 10Scott French: "Apparently, I completely forgot that this was statically defined in [0] rather than somehow following what we do in MediaWiki image builds" [puppet] - 10https://gerrit.wikimedia.org/r/1296036 (https://phabricator.wikimedia.org/T418200) (owner: 10Scott French) [23:52:29] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1296045 (owner: 10TrainBranchBot)