[00:00:55] FIRING: [2x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:25] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1290096 [00:04:28] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1290097 [00:04:32] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1290098 [00:05:55] FIRING: [2x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:53] (03PS1) 10Krinkle: mediawiki: Disable legacy `short_urls` on vhosts where it does not work [puppet] - 10https://gerrit.wikimedia.org/r/1290104 (https://phabricator.wikimedia.org/T107188) [00:25:24] (03CR) 10CI reject: [V:04-1] mediawiki: Disable legacy `short_urls` on vhosts where it does not work [puppet] - 10https://gerrit.wikimedia.org/r/1290104 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [00:25:57] (03PS2) 10Krinkle: mediawiki: Disable legacy `short_urls` on vhosts where it does not work [puppet] - 10https://gerrit.wikimedia.org/r/1290104 (https://phabricator.wikimedia.org/T107188) [00:27:06] (03PS3) 10Krinkle: mediawiki: Disable legacy `short_urls` on vhosts where it does not work [puppet] - 10https://gerrit.wikimedia.org/r/1290104 (https://phabricator.wikimedia.org/T107188) [00:28:27] (03CR) 10Jforrester: [C:03+1] mediawiki: Disable legacy `short_urls` on vhosts where it does not work [puppet] - 10https://gerrit.wikimedia.org/r/1290104 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [00:29:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290035 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [00:55:05] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [01:00:05] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1289454 (owner: 10TrainBranchBot) [01:06:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:09:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1290114 [01:09:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1290114 (owner: 10TrainBranchBot) [01:11:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:44] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1290114 (owner: 10TrainBranchBot) [01:22:52] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1027.eqiad.wmnet [01:25:15] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:26:03] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:28:03] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:29:54] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1027.eqiad.wmnet [01:30:15] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:01:11] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:04:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:05:28] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:41] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 29s) [02:09:13] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:43] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [02:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:23] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on planet1003.eqiad.wmnet with reason: debug wip [02:47:05] PROBLEM - Host ml-serve1015 is DOWN: PING CRITICAL - Packet loss = 100% [02:48:35] RECOVERY - Host ml-serve1015 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [04:05:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:05:55] FIRING: [2x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:43] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:53:29] (03PS1) 10KartikMistry: Update Recommendation API to 2026-05-21-044522-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290266 [05:11:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:21:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1012.eqiad.wmnet with reason: Cloning [05:22:12] (03PS1) 10Marostegui: pc1012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1290287 (https://phabricator.wikimedia.org/T418973) [05:22:56] (03CR) 10Marostegui: [C:03+2] pc1012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1290287 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [05:24:46] (03PS1) 10Marostegui: instances.yaml: Add pc1022 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1290334 (https://phabricator.wikimedia.org/T418973) [05:26:47] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add pc1022 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1290334 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [05:29:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc1022 to pc2 master T418973', diff saved to https://phabricator.wikimedia.org/P92726 and previous config saved to /var/cache/conftool/dbconfig/20260521-052905-marostegui.json [05:29:10] T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973 [05:30:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc2 T418973', diff saved to https://phabricator.wikimedia.org/P92727 and previous config saved to /var/cache/conftool/dbconfig/20260521-053000-marostegui.json [05:37:15] (03PS1) 10Marostegui: instances.yaml: Remove pc1012 [puppet] - 10https://gerrit.wikimedia.org/r/1290478 (https://phabricator.wikimedia.org/T426930) [05:37:57] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove pc1012 [puppet] - 10https://gerrit.wikimedia.org/r/1290478 (https://phabricator.wikimedia.org/T426930) (owner: 10Marostegui) [05:38:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove pc1012 from dbctl T426930', diff saved to https://phabricator.wikimedia.org/P92728 and previous config saved to /var/cache/conftool/dbconfig/20260521-053858-marostegui.json [05:39:03] T426930: decommission pc1012.eqiad.wmnet - https://phabricator.wikimedia.org/T426930 [05:43:19] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791#11943544 (10Marostegui) @Jhancock.wm the broken disk is already removed from the disk (sdb) - I can try to make it blink. I am not sure how long the blink lasts so probably better to make it blink... [05:50:55] FIRING: [3x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:07] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T0600) [06:00:07] marostegui, Amir1, and federico3: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T0600). [06:12:37] (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/1290499 (https://phabricator.wikimedia.org/T426633) [06:13:52] (03CR) 10Marostegui: "@cwilliams@wikimedia.org FYI" [dns] - 10https://gerrit.wikimedia.org/r/1290499 (https://phabricator.wikimedia.org/T426633) (owner: 10Marostegui) [06:13:55] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/1290499 (https://phabricator.wikimedia.org/T426633) (owner: 10Marostegui) [06:13:59] !log marostegui@dns1004 START - running authdns-update [06:14:14] !log Failover m2-master T426633 [06:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:41] !log marostegui@dns1004 END - running authdns-update [06:17:08] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:21:00] (03CR) 10Brouberol: [C:03+1] "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1290054 (owner: 10Scott French) [06:21:18] (03CR) 10Brouberol: [C:03+1] Bitu: Adapt approvers for growthbook-readonly and growthbook-elevatedacccess [puppet] - 10https://gerrit.wikimedia.org/r/1289999 (owner: 10Muehlenhoff) [06:21:45] (03CR) 10Muehlenhoff: [C:03+2] Bitu: Adapt approvers for growthbook-readonly and growthbook-elevatedacccess [puppet] - 10https://gerrit.wikimedia.org/r/1289999 (owner: 10Muehlenhoff) [06:22:10] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.reboot-gerrit Rebooting Gerrit on gerrit2003 [06:23:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11943571 (10MoritzMuehlenhoff) [06:23:24] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.reboot-gerrit (exit_code=99) Rebooting Gerrit on gerrit2003 [06:24:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to drbd [06:25:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11943574 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet switching disk type to drbd [06:27:35] FIRING: [4x] ProbeDown: Service gerrit2003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:28:52] PROBLEM - Host gerrit2003 is DOWN: PING CRITICAL - Packet loss = 100% [06:29:13] FIRING: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:30:28] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:30:32] RECOVERY - Host gerrit2003 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms [06:32:31] RESOLVED: [4x] ProbeDown: Service gerrit2003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:33:20] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet [06:33:28] PROBLEM - Host contint1002 is DOWN: PING CRITICAL - Packet loss = 100% [06:34:04] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host lists1004.wikimedia.org [06:34:13] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:34:45] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [06:34:58] RECOVERY - Host contint1002 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [06:35:10] FIRING: ProbeDown: Service contint1002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:36:14] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [06:38:24] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [06:39:59] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet [06:40:10] RESOLVED: [4x] ProbeDown: Service contint1002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:40:44] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [06:42:17] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists1004.wikimedia.org [06:45:18] PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [06:47:50] (03CR) 10Muehlenhoff: "The patch is fine as-is, but if we rename it, let's rather properly also rename is to profile::mariadb::firewall? This should have been a " [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [06:49:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to drbd [06:49:24] RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [06:52:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to plain [06:52:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11943593 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet switching disk type to plain [06:53:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to plain [06:54:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to drbd [06:55:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11943597 (10ops-monitoring-bot) VM dse-k8s-etcd1001.eqiad.wmnet switching disk type to drbd [07:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to drbd [07:17:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy1025.eqiad.wmnet with reason: Rebooting [07:18:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to plain [07:18:28] (03CR) 10Andriy.v: [C:03+1] Disable wgUseFilePatrol in ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290088 (https://phabricator.wikimedia.org/T426905) (owner: 10Neriah) [07:18:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11943638 (10ops-monitoring-bot) VM dse-k8s-etcd1001.eqiad.wmnet switching disk type to plain [07:20:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to plain [07:21:34] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:21:56] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1256: Upgrading db1256.eqiad.wmnet [07:22:25] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1256: Upgrading db1256.eqiad.wmnet [07:24:02] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1256.eqiad.wmnet with OS trixie [07:24:22] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp6002.drmrs.wmnet} and A:cp [07:27:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1005.eqiad.wmnet to drbd [07:28:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11943665 (10ops-monitoring-bot) VM kubestagemaster1005.eqiad.wmnet switching disk type to drbd [07:30:20] (03CR) 10Fabfur: [C:03+1] Remove cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1290006 (https://phabricator.wikimedia.org/T426828) (owner: 10BCornwall) [07:35:17] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6002.drmrs.wmnet [07:35:17] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp6002.drmrs.wmnet} and A:cp [07:35:48] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp6010.drmrs.wmnet} and A:cp [07:38:42] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1256.eqiad.wmnet with reason: host reimage [07:42:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1005.eqiad.wmnet to drbd [07:42:25] PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100% [07:43:17] RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [07:43:58] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1256.eqiad.wmnet with reason: host reimage [07:44:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1005.eqiad.wmnet to plain [07:44:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11943691 (10ops-monitoring-bot) VM kubestagemaster1005.eqiad.wmnet switching disk type to plain [07:45:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1005.eqiad.wmnet to plain [07:45:28] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:46:14] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6010.drmrs.wmnet [07:46:14] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp6010.drmrs.wmnet} and A:cp [07:47:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet [07:47:56] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1290555 (https://phabricator.wikimedia.org/T426633) [07:48:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11943693 (10ops-monitoring-bot) Draining ganeti1023.eqiad.wmnet of running VMs [07:48:21] !log Failover m3-master T426633 [07:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:28] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1290555 (https://phabricator.wikimedia.org/T426633) (owner: 10Marostegui) [07:50:17] !log marostegui@dns1004 START - running authdns-update [07:50:28] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:51:54] !log marostegui@dns1004 END - running authdns-update [07:52:34] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp600[3-4].drmrs.wmnet} and A:cp [08:00:05] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T0800) [08:00:16] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1256.eqiad.wmnet with OS trixie [08:01:31] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6003.drmrs.wmnet [08:02:29] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1256: Migration of db1256.eqiad.wmnet completed [08:09:45] the train blocker task got resolved | T426832 [08:09:46] T426832: userrights-interwiki fails with server error - https://phabricator.wikimedia.org/T426832 [08:09:51] I am going to look at the backend logs [08:10:20] (03PS1) 10Slyngshede: R:cache::upload enable TCP Fast Open [puppet] - 10https://gerrit.wikimedia.org/r/1290678 (https://phabricator.wikimedia.org/T415454) [08:11:11] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2162 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1290679 (https://phabricator.wikimedia.org/T426936) [08:12:06] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8569/console" [puppet] - 10https://gerrit.wikimedia.org/r/1290678 (https://phabricator.wikimedia.org/T415454) (owner: 10Slyngshede) [08:14:43] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8570/co" [puppet] - 10https://gerrit.wikimedia.org/r/1290678 (https://phabricator.wikimedia.org/T415454) (owner: 10Slyngshede) [08:16:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1036.eqiad.wmnet with reason: Maintenance [08:16:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1036 (T426633)', diff saved to https://phabricator.wikimedia.org/P92731 and previous config saved to /var/cache/conftool/dbconfig/20260521-081642-fceratto.json [08:20:40] lets roll [08:21:05] choo choo! [08:21:30] (03PS1) 10TrainBranchBot: group2 to 1.47.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290680 (https://phabricator.wikimedia.org/T423912) [08:21:33] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290680 (https://phabricator.wikimedia.org/T423912) (owner: 10TrainBranchBot) [08:22:37] https://spiderpig.wikimedia.org/jobs/2058 [private] [08:22:57] (03Merged) 10jenkins-bot: group2 to 1.47.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290680 (https://phabricator.wikimedia.org/T423912) (owner: 10TrainBranchBot) [08:25:40] (03PS3) 10Dpogorzelski: ml-serve(grpc): step 3, add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) [08:25:47] (03CR) 10Mszwarc: Update UserInfoCard to be enabled by default for certain user groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289895 (https://phabricator.wikimedia.org/T426021) (owner: 10Mszwarc) [08:27:01] (03PS1) 10Effie Mouzeli: site.pp: switch forthcoming servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1290681 (https://phabricator.wikimedia.org/T423719) [08:28:27] (03PS3) 10Dpogorzelski: ml-serve(grpc): step 2, add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) [08:28:40] (03CR) 10Dpogorzelski: ml-serve(grpc): step 2, add entry to service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [08:29:03] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.47.0-wmf.3 refs T423912 [08:29:07] T423912: 1.47.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T423912 [08:29:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1036 (T426633)', diff saved to https://phabricator.wikimedia.org/P92733 and previous config saved to /var/cache/conftool/dbconfig/20260521-082951-fceratto.json [08:30:03] jouncebot: next [08:30:03] In 1 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1000) [08:30:07] jouncebot: now [08:30:07] For the next 1 hour(s) and 29 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T0800) [08:37:35] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1036: Repooling [08:40:24] effie: I promoted the remaining wikis to 1.47.0-wmf.3 and currently looking at logs/graph etc ;) [08:40:42] did you want to deploy anything? [08:40:48] (03CR) 10JMeybohm: [C:04-2] "We can't use nftables for k8s workers as of now" [puppet] - 10https://gerrit.wikimedia.org/r/1290681 (https://phabricator.wikimedia.org/T423719) (owner: 10Effie Mouzeli) [08:42:20] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6004.drmrs.wmnet [08:42:20] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp600[3-4].drmrs.wmnet} and A:cp [08:44:03] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11943913 (10AnnieKim_WMDE) In discussing what I need in order to do my work, @AndrewTavis_WMDE indicated tha... [08:44:34] (03CR) 10Elukey: ml-serve: update kserve/knative on prod codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289935 (https://phabricator.wikimedia.org/T426823) (owner: 10Dpogorzelski) [08:44:40] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp601[1-2].drmrs.wmnet} and A:cp [08:46:03] (03CR) 10Effie Mouzeli: "And that is why moritz cries himself to sleep every night." [puppet] - 10https://gerrit.wikimedia.org/r/1290681 (https://phabricator.wikimedia.org/T423719) (owner: 10Effie Mouzeli) [08:46:13] (03Abandoned) 10Effie Mouzeli: site.pp: switch forthcoming servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1290681 (https://phabricator.wikimedia.org/T423719) (owner: 10Effie Mouzeli) [08:46:50] (03CR) 10FNegri: [C:03+1] "LGTM, no concerns from my side. I can do a couple of quick connectivity checks after this lands in clouddbs." [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [08:47:06] (03PS1) 10Muehlenhoff: Remove ganeti1023 from Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1290683 (https://phabricator.wikimedia.org/T424680) [08:47:15] hashar: thank you, I can wait longer ! [08:47:15] (03CR) 10Elukey: [C:03+1] "totally trust you :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288881 (https://phabricator.wikimedia.org/T380626) (owner: 10Btullis) [08:47:57] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1256: Migration of db1256.eqiad.wmnet completed [08:47:58] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [08:49:43] I am double checking because yesterday I missed some errors that are somehow hidden [08:49:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet [08:50:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11943959 (10SLyngshede-WMF) @Dzahn if you verified the ssh key out of band, then we can just restore your pa... [08:52:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11943962 (10SLyngshede-WMF) [08:53:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11943963 (10SLyngshede-WMF) @Ottomata already approved. [08:55:14] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6011.drmrs.wmnet [08:58:29] (03CR) 10Filippo Giunchedi: [C:03+2] alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [09:01:12] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host kafka-jumbo1016.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:02:10] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1375.eqiad.wmnet [09:04:13] btullis@cumin1003 provision (PID 2411651) is awaiting input [09:04:20] (03PS1) 10JavierMonton: stream: webrequest.page_view [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) [09:06:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1037.eqiad.wmnet with reason: Maintenance [09:06:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1037 (T426633)', diff saved to https://phabricator.wikimedia.org/P92738 and previous config saved to /var/cache/conftool/dbconfig/20260521-090609-fceratto.json [09:06:28] (03PS3) 10Elukey: redfish: add add_account method for RedfishDell [software/spicerack] - 10https://gerrit.wikimedia.org/r/1287905 (https://phabricator.wikimedia.org/T426180) [09:07:29] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1375.eqiad.wmnet [09:07:32] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1376.eqiad.wmnet [09:07:59] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1036: Repooling [09:11:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:50] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1376.eqiad.wmnet [09:12:54] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1377.eqiad.wmnet [09:15:55] FIRING: [4x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:13] (03PS4) 10Elukey: redfish: add add_account method for RedfishDell [software/spicerack] - 10https://gerrit.wikimedia.org/r/1287905 (https://phabricator.wikimedia.org/T426180) [09:16:44] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1377.eqiad.wmnet [09:16:47] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1378.eqiad.wmnet [09:17:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037 (T426633)', diff saved to https://phabricator.wikimedia.org/P92740 and previous config saved to /var/cache/conftool/dbconfig/20260521-091738-fceratto.json [09:18:52] !log remove ganeti1023 foom eqiad Ganeti cluster T424680 [09:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:56] T424680: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680 [09:18:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1016.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:19:34] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti1023 from Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1290683 (https://phabricator.wikimedia.org/T424680) (owner: 10Muehlenhoff) [09:20:51] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS trixie [09:21:32] PROBLEM - ganeti-noded running on ganeti1023 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:21:32] PROBLEM - ganeti-confd running on ganeti1023 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:21:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1056.eqiad.wmnet to cluster eqiad and group A [09:21:51] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-master-codfw [09:21:55] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2001.codfw.wmnet [09:21:57] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2001.codfw.wmnet [09:22:14] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1378.eqiad.wmnet [09:22:18] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1379.eqiad.wmnet [09:22:42] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-master-eqiad [09:22:46] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl1002.eqiad.wmnet [09:22:47] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl1002.eqiad.wmnet [09:22:50] FIRING: ProbeDown: Service ganeti1023:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:23:32] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode1001.eqiad.wmnet [09:23:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1056.eqiad.wmnet to cluster eqiad and group A [09:24:08] !log jayme@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=helm-charts.*,name=codfw [09:25:04] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host chartmuseum2001.codfw.wmnet [09:26:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [09:27:19] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet [09:27:43] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1379.eqiad.wmnet [09:27:47] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1380.eqiad.wmnet [09:27:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037', diff saved to https://phabricator.wikimedia.org/P92741 and previous config saved to /var/cache/conftool/dbconfig/20260521-092746-fceratto.json [09:28:09] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode2001.codfw.wmnet [09:29:05] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum2001.codfw.wmnet [09:29:25] !log jayme@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=helm-charts.*,name=codfw [09:29:31] !log jayme@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=helm-charts.*,name=eqiad [09:29:48] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host chartmuseum1001.eqiad.wmnet [09:29:57] effie: train looks fine as far as I can tell. The cluster is all your! :] [09:30:02] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl1002.eqiad.wmnet [09:30:03] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl1002.eqiad.wmnet [09:30:09] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl1003.eqiad.wmnet [09:30:10] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl1003.eqiad.wmnet [09:31:36] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2001.codfw.wmnet [09:31:37] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2001.codfw.wmnet [09:31:44] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2002.codfw.wmnet [09:31:46] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2002.codfw.wmnet [09:31:59] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode2001.codfw.wmnet [09:32:12] (03CR) 10Clément Goubert: "I think what's happening is we're just caching stuff we didn't use to, or caching them differently. I'll bump investigating this with Traf" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287731 (https://phabricator.wikimedia.org/T426323) (owner: 10Kosta Harlan) [09:32:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [09:33:08] hashar: chers [09:33:14] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1380.eqiad.wmnet [09:33:18] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1381.eqiad.wmnet [09:33:47] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum1001.eqiad.wmnet [09:33:52] (03CR) 10Effie Mouzeli: [C:03+2] changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285342 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [09:34:51] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry1004.eqiad.wmnet [09:35:49] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6012.drmrs.wmnet [09:35:49] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp601[1-2].drmrs.wmnet} and A:cp [09:36:13] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-jumbo1016.eqiad.wmnet with OS trixie [09:36:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet [09:37:27] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl1003.eqiad.wmnet [09:37:28] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl1003.eqiad.wmnet [09:37:29] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11944058 (10MoritzMuehlenhoff) [09:37:35] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl1004.eqiad.wmnet [09:37:36] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl1004.eqiad.wmnet [09:37:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS trixie [09:37:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037', diff saved to https://phabricator.wikimedia.org/P92742 and previous config saved to /var/cache/conftool/dbconfig/20260521-093754-fceratto.json [09:38:03] (03PS2) 10Effie Mouzeli: changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285342 (https://phabricator.wikimedia.org/T419976) [09:38:03] (03PS5) 10JavierMonton: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289894 (https://phabricator.wikimedia.org/T426425) [09:38:21] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2002.codfw.wmnet [09:38:22] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2002.codfw.wmnet [09:38:28] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2003.codfw.wmnet [09:38:30] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2003.codfw.wmnet [09:38:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet [09:38:36] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1381.eqiad.wmnet [09:38:40] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1382.eqiad.wmnet [09:38:49] (03PS1) 10AikoChou: ml-services: update article-country, revertrisk, outlink, revertrisk-multilingual image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290699 (https://phabricator.wikimedia.org/T426807) [09:39:20] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1004.eqiad.wmnet [09:40:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet [09:41:18] (03CR) 10Elukey: [C:03+1] "I like it! Fixed the CI error and improved the readability of one if condition (at least for me), lemme know what you think about it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1287905 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [09:42:06] (03CR) 10Effie Mouzeli: [C:03+1] changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285342 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [09:42:16] (03PS3) 10Effie Mouzeli: changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285342 (https://phabricator.wikimedia.org/T419976) [09:42:34] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new: Another blob upload invalid error when pushing to docker-registry - https://phabricator.wikimedia.org/T422424#11944079 (10JMeybohm) →14Duplicate dup:03T390251 [09:42:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet [09:44:04] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1382.eqiad.wmnet [09:44:05] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1014.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:44:08] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1383.eqiad.wmnet [09:45:01] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl1004.eqiad.wmnet [09:45:02] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl1004.eqiad.wmnet [09:45:02] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-master-eqiad [09:45:37] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2003.codfw.wmnet [09:45:39] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2003.codfw.wmnet [09:45:44] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2004.codfw.wmnet [09:45:46] (03CR) 10AikoChou: [C:03+2] ml-services: update article-country, revertrisk, outlink, revertrisk-multilingual image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290699 (https://phabricator.wikimedia.org/T426807) (owner: 10AikoChou) [09:45:46] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2004.codfw.wmnet [09:46:03] PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100% [09:46:08] (03CR) 10Ilias Sarantopoulos: [C:04-1] ml-services: update article-country, revertrisk, outlink, revertrisk-multilingual image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290699 (https://phabricator.wikimedia.org/T426807) (owner: 10AikoChou) [09:47:12] (03CR) 10Ilias Sarantopoulos: [C:04-1] "I think you accidentally set the min and max replicas of RRML to 1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290699 (https://phabricator.wikimedia.org/T426807) (owner: 10AikoChou) [09:47:18] (03CR) 10Ilias Sarantopoulos: [V:04-1 C:04-1] ml-services: update article-country, revertrisk, outlink, revertrisk-multilingual image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290699 (https://phabricator.wikimedia.org/T426807) (owner: 10AikoChou) [09:47:30] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Rebooting clouddb1013 T426563 [09:47:38] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet [09:48:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1037 (T426633)', diff saved to https://phabricator.wikimedia.org/P92743 and previous config saved to /var/cache/conftool/dbconfig/20260521-094801-fceratto.json [09:48:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1047.eqiad.wmnet with reason: Maintenance [09:48:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1036.eqiad.wmnet [09:48:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1047 (T426633)', diff saved to https://phabricator.wikimedia.org/P92744 and previous config saved to /var/cache/conftool/dbconfig/20260521-094829-fceratto.json [09:48:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet [09:48:39] (03CR) 10AikoChou: "That's in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290699 (https://phabricator.wikimedia.org/T426807) (owner: 10AikoChou) [09:48:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet [09:48:57] (03CR) 10Effie Mouzeli: [C:03+2] changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285342 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [09:49:21] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1383.eqiad.wmnet [09:49:23] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1014.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:49:25] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1384.eqiad.wmnet [09:49:40] (03CR) 10JavierMonton: [C:03+2] stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289894 (https://phabricator.wikimedia.org/T426425) (owner: 10JavierMonton) [09:50:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet [09:50:50] FIRING: KubernetesCalicoDown: ml-serve1014.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1014.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:50:59] (03Merged) 10jenkins-bot: changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285342 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [09:51:28] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] "nevermind sorry, I overlooked that was in staging, everything looks great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290699 (https://phabricator.wikimedia.org/T426807) (owner: 10AikoChou) [09:51:46] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [09:51:54] (03Merged) 10jenkins-bot: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289894 (https://phabricator.wikimedia.org/T426425) (owner: 10JavierMonton) [09:52:11] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2004.codfw.wmnet [09:52:12] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2004.codfw.wmnet [09:52:16] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [09:52:18] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2005.codfw.wmnet [09:52:20] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2005.codfw.wmnet [09:52:35] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:52:50] RESOLVED: [2x] ProbeDown: Service ganeti1023:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:53:12] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:53:36] (03Merged) 10jenkins-bot: ml-services: update article-country, revertrisk, outlink, revertrisk-multilingual image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290699 (https://phabricator.wikimedia.org/T426807) (owner: 10AikoChou) [09:54:18] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:54:23] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:54:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org [09:54:43] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1384.eqiad.wmnet [09:55:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047 (T426633)', diff saved to https://phabricator.wikimedia.org/P92745 and previous config saved to /var/cache/conftool/dbconfig/20260521-095536-fceratto.json [09:55:48] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:55:49] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:56:02] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:56:12] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:56:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet [09:56:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet [09:56:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1040.eqiad.wmnet [09:57:00] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1014:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [09:58:05] PROBLEM - Host ml-serve1013 is DOWN: PING CRITICAL - Packet loss = 100% [09:58:35] RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [09:58:54] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2005.codfw.wmnet [09:58:56] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2005.codfw.wmnet [09:58:56] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-master-codfw [09:59:39] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry1005.eqiad.wmnet [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1000) [10:00:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org [10:00:31] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:00:50] RESOLVED: KubernetesCalicoDown: ml-serve1014.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1014.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:00:59] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:01:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1040.eqiad.wmnet [10:01:05] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:02:00] RESOLVED: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1014:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [10:02:20] FIRING: [2x] KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:02:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [10:03:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1040.eqiad.wmnet [10:03:46] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-ctrl100[56] implementation tracking - https://phabricator.wikimedia.org/T418920#11944124 (10JMeybohm) [10:04:09] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1005.eqiad.wmnet [10:04:13] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:04:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1040.eqiad.wmnet [10:04:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet [10:05:46] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet [10:05:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047', diff saved to https://phabricator.wikimedia.org/P92746 and previous config saved to /var/cache/conftool/dbconfig/20260521-100545-fceratto.json [10:05:49] (03PS1) 10Btullis: Add a custom partman reuse recipe for kafka-jumbo101[67] [puppet] - 10https://gerrit.wikimedia.org/r/1290700 (https://phabricator.wikimedia.org/T426585) [10:06:54] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1290700 (https://phabricator.wikimedia.org/T426585) (owner: 10Btullis) [10:07:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1041.eqiad.wmnet [10:07:53] (03CR) 10CI reject: [V:04-1] Add a custom partman reuse recipe for kafka-jumbo101[67] [puppet] - 10https://gerrit.wikimedia.org/r/1290700 (https://phabricator.wikimedia.org/T426585) (owner: 10Btullis) [10:07:56] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet [10:08:03] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1013.eqiad.wmnet [10:08:03] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1013.eqiad.wmnet [10:08:30] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1013:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [10:09:08] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:09:19] (03PS2) 10Btullis: Add a custom partman reuse recipe for kafka-jumbo101[67] [puppet] - 10https://gerrit.wikimedia.org/r/1290700 (https://phabricator.wikimedia.org/T426585) [10:09:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org [10:09:41] PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:09:55] PROBLEM - Host kubestagemaster1004 is DOWN: PING CRITICAL - Packet loss = 100% [10:10:21] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet [10:10:31] RECOVERY - Host kubestagemaster1004 is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [10:10:35] RECOVERY - Host ml-serve1013 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:10:35] RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [10:10:36] (03CR) 10Ilias Sarantopoulos: "Thanks both for making this change. I added a couple of nitpicks and a request to also add support for v1/completions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:11:31] (03CR) 10Btullis: [C:03+2] Add a custom partman reuse recipe for kafka-jumbo101[67] [puppet] - 10https://gerrit.wikimedia.org/r/1290700 (https://phabricator.wikimedia.org/T426585) (owner: 10Btullis) [10:12:20] RESOLVED: [2x] KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:12:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1041.eqiad.wmnet [10:12:57] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp600[5-6].drmrs.wmnet} and A:cp [10:12:59] !log installing postgresql security updates [10:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1041.eqiad.wmnet [10:13:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1042.eqiad.wmnet [10:14:11] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [10:14:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:15:08] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-jumbo1016.eqiad.wmnet with OS trixie [10:15:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047', diff saved to https://phabricator.wikimedia.org/P92747 and previous config saved to /var/cache/conftool/dbconfig/20260521-101552-fceratto.json [10:17:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1042.eqiad.wmnet [10:17:11] FIRING: Temperature: Temp issue on wdqs1023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [10:17:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2011.codfw.wmnet [10:17:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS trixie [10:18:30] RESOLVED: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve1013:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [10:18:47] PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:20:47] RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [10:22:11] RESOLVED: Temperature: Temp issue on wdqs1023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [10:22:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1042.eqiad.wmnet [10:22:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1042.eqiad.wmnet [10:24:01] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6005.drmrs.wmnet [10:24:01] (03PS1) 10Clément Goubert: gateway-check: Temp route linkrecommendation to api-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1290708 (https://phabricator.wikimedia.org/T426323) [10:24:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2011.codfw.wmnet [10:26:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047 (T426633)', diff saved to https://phabricator.wikimedia.org/P92748 and previous config saved to /var/cache/conftool/dbconfig/20260521-102601-fceratto.json [10:26:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1043.eqiad.wmnet [10:26:19] (03PS1) 10Effie Mouzeli: ProductionServices.php: switch filebackend.php to rdb2011:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290709 (https://phabricator.wikimedia.org/T418261) [10:26:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance [10:26:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2036 (T426633)', diff saved to https://phabricator.wikimedia.org/P92749 and previous config saved to /var/cache/conftool/dbconfig/20260521-102630-fceratto.json [10:27:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:27:17] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [10:27:51] !log T423993: reindexing all archive indices [10:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:55] T423993: Upgrade old indices in the CirrusSearch opensearch clusters - https://phabricator.wikimedia.org/T423993 [10:29:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet [10:29:54] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:31:14] (03CR) 10Clément Goubert: rest-gateway: Configure qwen3-14b in rest-gateway (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:32:53] (03CR) 10Clément Goubert: [C:03+1] ProductionServices.php: switch filebackend.php to rdb2011:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290709 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [10:33:12] jouncebot: now [10:33:12] For the next 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1000) [10:33:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jiji@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290709 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [10:34:27] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:35:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet [10:35:16] (03Merged) 10jenkins-bot: ProductionServices.php: switch filebackend.php to rdb2011:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290709 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [10:35:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1043.eqiad.wmnet [10:35:36] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry2005.codfw.wmnet [10:35:55] FIRING: [3x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:06] !log jiji@deploy1003 Started scap sync-world: Backport for [[gerrit:1290709|ProductionServices.php: switch filebackend.php to rdb2011:6381 (T418261 T419976)]] [10:36:11] T418261: rdb20[11-12] implementation tracking - https://phabricator.wikimedia.org/T418261 [10:36:12] T419976: Upgrade redis_misc hosts to Debian Trixie (Redis 8.0) - https://phabricator.wikimedia.org/T419976 [10:37:52] !log jiji@deploy1003 jiji: Backport for [[gerrit:1290709|ProductionServices.php: switch filebackend.php to rdb2011:6381 (T418261 T419976)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:38:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2036 (T426633)', diff saved to https://phabricator.wikimedia.org/P92751 and previous config saved to /var/cache/conftool/dbconfig/20260521-103759-fceratto.json [10:39:57] !log jiji@deploy1003 jiji: Continuing with deployment [10:40:09] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-jumbo1016.eqiad.wmnet with OS trixie [10:40:11] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2005.codfw.wmnet [10:41:32] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS trixie [10:42:48] (03CR) 10Btullis: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [10:43:35] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:43:39] (03PS16) 10Btullis: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [10:43:49] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [10:44:08] !log jiji@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290709|ProductionServices.php: switch filebackend.php to rdb2011:6381 (T418261 T419976)]] (duration: 08m 02s) [10:44:13] T418261: rdb20[11-12] implementation tracking - https://phabricator.wikimedia.org/T418261 [10:44:14] T419976: Upgrade redis_misc hosts to Debian Trixie (Redis 8.0) - https://phabricator.wikimedia.org/T419976 [10:46:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet [10:46:36] (03PS1) 10Elukey: profile::services_proxy::envoy: lower Wikifunctions' timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1290712 [10:47:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2012.codfw.wmnet [10:48:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2036', diff saved to https://phabricator.wikimedia.org/P92752 and previous config saved to /var/cache/conftool/dbconfig/20260521-104807-fceratto.json [10:50:04] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:50:09] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:50:32] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:50:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:50:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet [10:51:09] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:51:17] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:51:37] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:51:49] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-worker23[57-74] implementation tracking - https://phabricator.wikimedia.org/T418927#11944246 (10Blake) 05Open→03In progress a:03Blake [10:52:10] (03CR) 10Btullis: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [10:53:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2012.codfw.wmnet [10:54:02] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw [10:55:11] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1016.eqiad.wmnet with reason: host reimage [10:56:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet [10:56:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet [10:57:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [10:58:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2036', diff saved to https://phabricator.wikimedia.org/P92753 and previous config saved to /var/cache/conftool/dbconfig/20260521-105815-fceratto.json [10:58:30] (03PS1) 10Btullis: Update partman reuse recipe for kafka-jumbo101[67] [puppet] - 10https://gerrit.wikimedia.org/r/1290714 (https://phabricator.wikimedia.org/T426585) [10:58:47] (03CR) 10Neriah: Enable 'flood' user group at en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) (owner: 10Neriah) [10:59:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1016.eqiad.wmnet with reason: host reimage [10:59:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290088 (https://phabricator.wikimedia.org/T426905) (owner: 10Neriah) [11:00:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [11:01:29] (03PS1) 10JavierMonton: flink: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290715 (https://phabricator.wikimedia.org/T426425) [11:02:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw [11:04:49] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6006.drmrs.wmnet [11:04:49] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp600[5-6].drmrs.wmnet} and A:cp [11:05:26] (03CR) 10Ottomata: stream: webrequest.page_view (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) (owner: 10JavierMonton) [11:05:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2013.codfw.wmnet [11:05:46] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad [11:06:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet [11:06:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet [11:07:13] (03CR) 10Btullis: [C:03+2] Update partman reuse recipe for kafka-jumbo101[67] [puppet] - 10https://gerrit.wikimedia.org/r/1290714 (https://phabricator.wikimedia.org/T426585) (owner: 10Btullis) [11:08:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2036 (T426633)', diff saved to https://phabricator.wikimedia.org/P92756 and previous config saved to /var/cache/conftool/dbconfig/20260521-110822-fceratto.json [11:08:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2037.codfw.wmnet with reason: Maintenance [11:08:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2037 (T426633)', diff saved to https://phabricator.wikimedia.org/P92757 and previous config saved to /var/cache/conftool/dbconfig/20260521-110851-fceratto.json [11:09:38] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp601[3-4].drmrs.wmnet} and A:cp [11:11:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [11:12:11] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:12:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2013.codfw.wmnet [11:13:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1016.eqiad.wmnet with OS trixie [11:14:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad [11:15:31] jmm@cumin2002 drain-node (PID 377622) is awaiting input [11:17:20] (03PS17) 10Btullis: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [11:17:34] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [11:18:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [11:20:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037 (T426633)', diff saved to https://phabricator.wikimedia.org/P92758 and previous config saved to /var/cache/conftool/dbconfig/20260521-112021-fceratto.json [11:20:43] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6013.drmrs.wmnet [11:20:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2014.codfw.wmnet [11:21:11] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:22:23] (03PS1) 10Btullis: Remove absented sqoop resources [puppet] - 10https://gerrit.wikimedia.org/r/1290716 (https://phabricator.wikimedia.org/T424355) [11:24:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [11:24:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [11:24:54] (03PS2) 10Dpogorzelski: ml-serve: update kserve/knative on prod codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289935 (https://phabricator.wikimedia.org/T426823) [11:24:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet [11:25:38] (03PS7) 10Gkyziridis: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) [11:25:43] (03CR) 10Dpogorzelski: "removed block from staging and prod. if that's what you meant :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289935 (https://phabricator.wikimedia.org/T426823) (owner: 10Dpogorzelski) [11:26:23] (03PS3) 10Arthur taylor: Disable support for PHP-serialized EntityData on Wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289898 (https://phabricator.wikimedia.org/T98035) [11:26:50] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf1001.eqiad.wmnet [11:26:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet [11:27:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet [11:27:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:27:16] (03PS4) 10Arthur taylor: Disable support for PHP-serialized EntityData on Wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289898 (https://phabricator.wikimedia.org/T98035) [11:27:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2014.codfw.wmnet [11:27:24] (03CR) 10Arthur taylor: Disable support for PHP-serialized EntityData on Wikidata production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289898 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [11:27:28] (03CR) 10JavierMonton: [C:03+2] flink: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290715 (https://phabricator.wikimedia.org/T426425) (owner: 10JavierMonton) [11:27:48] (03CR) 10Btullis: "Done." [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [11:29:43] PROBLEM - Host kubestagemaster1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:30:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037', diff saved to https://phabricator.wikimedia.org/P92759 and previous config saved to /var/cache/conftool/dbconfig/20260521-113028-fceratto.json [11:30:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet [11:31:38] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc1001.eqiad.wmnet [11:32:32] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1001.eqiad.wmnet [11:33:00] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker-exp2001.codfw.wmnet [11:33:52] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2026-05-21-044522-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290266 (owner: 10KartikMistry) [11:33:57] FIRING: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:34:08] (03CR) 10Neriah: [C:04-1] "per T424413#11944351" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281901 (https://phabricator.wikimedia.org/T424413) (owner: 10Codename Noreste) [11:34:46] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host testreduce1002.eqiad.wmnet [11:34:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet [11:35:04] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf1002.eqiad.wmnet [11:35:13] !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker [11:35:15] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-staging2001.codfw.wmnet [11:35:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet [11:35:38] (03Merged) 10jenkins-bot: flink: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290715 (https://phabricator.wikimedia.org/T426425) (owner: 10JavierMonton) [11:36:17] RECOVERY - Host kubestagemaster1003 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [11:36:54] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1001.eqiad.wmnet [11:36:58] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc1002.eqiad.wmnet [11:37:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1049.eqiad.wmnet [11:38:40] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1002.eqiad.wmnet [11:38:57] RESOLVED: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:39:05] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker-exp2001.codfw.wmnet [11:39:10] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker-exp1001.eqiad.wmnet [11:40:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037', diff saved to https://phabricator.wikimedia.org/P92760 and previous config saved to /var/cache/conftool/dbconfig/20260521-114036-fceratto.json [11:40:48] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1002.eqiad.wmnet [11:42:10] (03Merged) 10jenkins-bot: Update Recommendation API to 2026-05-21-044522-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290266 (owner: 10KartikMistry) [11:42:14] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1002.eqiad.wmnet [11:42:39] jmm@cumin2002 drain-node (PID 395198) is awaiting input [11:43:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org [11:44:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1049.eqiad.wmnet [11:44:30] Deploying recommendation-api.. [11:44:37] !log kartik@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:45:13] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker-exp1001.eqiad.wmnet [11:45:21] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-staging2001.codfw.wmnet [11:48:45] (03CR) 10JMeybohm: [C:03+1] service: move Aux k8s' ingress to IPIP load balancing [puppet] - 10https://gerrit.wikimedia.org/r/1289274 (https://phabricator.wikimedia.org/T420439) (owner: 10Elukey) [11:48:50] (03CR) 10JMeybohm: [C:03+1] services: move the aux k8s' kubemaster to IPIP load balancing [puppet] - 10https://gerrit.wikimedia.org/r/1289273 (https://phabricator.wikimedia.org/T420439) (owner: 10Elukey) [11:49:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1049.eqiad.wmnet [11:49:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org [11:49:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1049.eqiad.wmnet [11:50:38] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-staging2001.codfw.wmnet [11:50:40] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-staging2001.codfw.wmnet [11:50:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2037 (T426633)', diff saved to https://phabricator.wikimedia.org/P92761 and previous config saved to /var/cache/conftool/dbconfig/20260521-115043-fceratto.json [11:50:44] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-staging2002.codfw.wmnet [11:50:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1050.eqiad.wmnet [11:51:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2047.codfw.wmnet with reason: Maintenance [11:51:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2047 (T426633)', diff saved to https://phabricator.wikimedia.org/P92762 and previous config saved to /var/cache/conftool/dbconfig/20260521-115112-fceratto.json [11:51:28] !log disabling puppet on C:bird to roll out 1289919 [11:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt1002.wikimedia.org [11:53:07] (03CR) 10Majavah: [V:03+1 C:03+2] bird: Create anycast-healthchecker run directory with tmpfiles [puppet] - 10https://gerrit.wikimedia.org/r/1289919 (https://phabricator.wikimedia.org/T426837) (owner: 10Majavah) [11:53:25] (03PS8) 10Gkyziridis: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) [11:53:46] (03CR) 10JMeybohm: [C:04-1] docker_registry: allow multiple docker instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [11:56:21] jmm@cumin2002 drain-node (PID 403007) is awaiting input [11:56:23] (03PS4) 10Brouberol: Upgrade kafka-jumbo1017 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289950 (https://phabricator.wikimedia.org/T426835) [11:57:01] (03CR) 10CI reject: [V:04-1] Upgrade kafka-jumbo1017 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289950 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [11:57:31] (03PS5) 10Brouberol: Upgrade kafka-jumbo1017 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289950 (https://phabricator.wikimedia.org/T426835) [11:57:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1050.eqiad.wmnet [11:58:16] (03CR) 10Brouberol: [C:03+1] Remove absented sqoop resources [puppet] - 10https://gerrit.wikimedia.org/r/1290716 (https://phabricator.wikimedia.org/T424355) (owner: 10Btullis) [11:58:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047 (T426633)', diff saved to https://phabricator.wikimedia.org/P92763 and previous config saved to /var/cache/conftool/dbconfig/20260521-115817-fceratto.json [11:58:19] PROBLEM - Host cloudlb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [11:58:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) (owner: 10Neriah) [11:58:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1002.wikimedia.org [11:58:55] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:59:16] (03CR) 10Brouberol: [C:03+2] Upgrade kafka-jumbo1017 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289950 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [11:59:51] RECOVERY - Host cloudlb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [11:59:55] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1200) [12:00:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:00:51] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-staging2002.codfw.wmnet [12:00:53] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1017.eqiad.wmnet with OS trixie [12:01:10] FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:01:25] (03PS1) 10Kosta Harlan: hCaptcha: Finish group1 account creation rollout + itwiki/hewiki for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290727 (https://phabricator.wikimedia.org/T426045) [12:01:36] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6014.drmrs.wmnet [12:01:36] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp601[3-4].drmrs.wmnet} and A:cp [12:02:13] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp600[7-8].drmrs.wmnet} and A:cp [12:02:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1050.eqiad.wmnet [12:03:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1050.eqiad.wmnet [12:05:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:06:10] RESOLVED: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:07:35] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-staging2002.codfw.wmnet [12:07:37] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-staging2002.codfw.wmnet [12:07:41] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-staging2003.codfw.wmnet [12:08:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047', diff saved to https://phabricator.wikimedia.org/P92764 and previous config saved to /var/cache/conftool/dbconfig/20260521-120824-fceratto.json [12:09:38] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: update kserve/knative on prod codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289935 (https://phabricator.wikimedia.org/T426823) (owner: 10Dpogorzelski) [12:10:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1051.eqiad.wmnet [12:11:16] PROBLEM - MariaDB Replica Lag: backup1-codfw on db2184 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [12:11:17] jouncebot: nowandnext [12:11:17] For the next 0 hour(s) and 48 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1200) [12:11:17] In 0 hour(s) and 48 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1300) [12:12:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2003.codfw.wmnet [12:13:00] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6007.drmrs.wmnet [12:14:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1051.eqiad.wmnet [12:14:30] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1290734 (owner: 10L10n-bot) [12:14:54] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Okay to deploy next Thursday, 2026-05-28." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289898 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [12:15:03] (03CR) 10Michael Große: "recheck" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) (owner: 10Michael Große) [12:15:20] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1017.eqiad.wmnet with reason: host reimage [12:15:31] (03PS2) 10KartikMistry: Update cxserver to 2026-05-20-034002-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289463 (https://phabricator.wikimedia.org/T388690) [12:16:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2003.codfw.wmnet [12:16:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [12:17:38] (03Merged) 10jenkins-bot: ml-serve: update kserve/knative on prod codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289935 (https://phabricator.wikimedia.org/T426823) (owner: 10Dpogorzelski) [12:17:51] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-staging2003.codfw.wmnet [12:18:18] (03CR) 10Kosta Harlan: "Related to I91dc92f0d6d5201c5765eec1b5e4cdd68e252373 ?" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) (owner: 10Michael Große) [12:18:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047', diff saved to https://phabricator.wikimedia.org/P92765 and previous config saved to /var/cache/conftool/dbconfig/20260521-121832-fceratto.json [12:18:48] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-05-20-034002-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289463 (https://phabricator.wikimedia.org/T388690) (owner: 10KartikMistry) [12:18:58] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/ml-staging-codfw: maintenance [12:18:58] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/ml-staging-codfw: maintenance [12:18:59] (03CR) 10Majavah: [C:03+1] wikireplicas: Migrate from ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [12:19:12] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/ml-staging-codfw: maintenance [12:19:12] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in codfw/ml-staging-codfw: maintenance [12:19:29] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/ml-serve-codfw: maintenance [12:19:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1051.eqiad.wmnet [12:19:43] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1017.eqiad.wmnet with reason: host reimage [12:19:53] (03CR) 10Tiziano Fogli: [C:03+1] prometheus, thanos: move recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1270480 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [12:20:03] (03CR) 10Tiziano Fogli: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1270480 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [12:20:11] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/ml-serve-codfw: maintenance [12:20:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1051.eqiad.wmnet [12:20:50] (03Merged) 10jenkins-bot: Update cxserver to 2026-05-20-034002-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289463 (https://phabricator.wikimedia.org/T388690) (owner: 10KartikMistry) [12:21:26] (03CR) 10Michael Große: "Mh, plausibly. I guess that means I have to also cherry-pick If83e2f1268134118cfd69c224ff4ceb5affd0e4f" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) (owner: 10Michael Große) [12:21:29] !log installing nginx security updates [12:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:35] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [12:21:47] (03PS1) 10Michael Große: composer.json: Updated symfony/yaml from 7.4.6 to 7.4.12 [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290743 (https://phabricator.wikimedia.org/T426861) [12:21:58] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:22:07] (03PS2) 10Michael Große: Skip init.test.js test if VisualEditor not installed [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) [12:22:10] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [12:22:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1052.eqiad.wmnet [12:23:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:23:23] (03CR) 10Jforrester: "Oh, oops, I thought we'd cherry-picked in core as well as vendor." [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290743 (https://phabricator.wikimedia.org/T426861) (owner: 10Michael Große) [12:23:49] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-staging2003.codfw.wmnet [12:23:51] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-staging2003.codfw.wmnet [12:23:51] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-staging-worker [12:25:30] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: fasw2-c8a-codfw:xe-0/0/47 low RX power - https://phabricator.wikimedia.org/T426824#11944579 (10Jgreen) >>! In T426824#11942813, @Jhancock.wm wrote: > i can get this one in the morning if Jeff or Dallas is around and want to coordinate. @Jhancock.wm I'... [12:25:55] FIRING: [2x] SystemdUnitFailed: netbox_report_network_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:26:45] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:26:47] (03PS1) 10Blake: Add wikikube-worker refreshes. [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) [12:27:03] (03PS2) 10Blake: Add wikikube-worker refreshes. [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) [12:27:03] (03CR) 10Jforrester: [C:04-1] "This is totally going in the wrong direction, turning 4xx errors into 5xx errors? Infrastructure timeouts (from Envoy or whatever) are the" [puppet] - 10https://gerrit.wikimedia.org/r/1290712 (owner: 10Elukey) [12:27:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:27:25] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:27:59] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:28:39] jmm@cumin2002 drain-node (PID 423692) is awaiting input [12:28:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2047 (T426633)', diff saved to https://phabricator.wikimedia.org/P92766 and previous config saved to /var/cache/conftool/dbconfig/20260521-122839-fceratto.json [12:28:55] (03CR) 10CI reject: [V:04-1] composer.json: Updated symfony/yaml from 7.4.6 to 7.4.12 [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290743 (https://phabricator.wikimedia.org/T426861) (owner: 10Michael Große) [12:28:58] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1039.eqiad.wmnet with reason: Maintenance [12:29:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1039 (T426633)', diff saved to https://phabricator.wikimedia.org/P92767 and previous config saved to /var/cache/conftool/dbconfig/20260521-122905-fceratto.json [12:29:10] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:29:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [12:30:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1003.eqiad.wmnet [12:30:11] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:30:44] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:32:04] !ack [12:32:05] All incidents are already acked. [12:32:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1052.eqiad.wmnet [12:33:08] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Finish group1 account creation rollout + itwiki/hewiki for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290727 (https://phabricator.wikimedia.org/T426045) (owner: 10Kosta Harlan) [12:34:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1003.eqiad.wmnet [12:34:14] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:34:17] !log Updated cxserver to 2026-05-20-034002-production (T388690, T404295, T391703, T426605) [12:34:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290727 (https://phabricator.wikimedia.org/T426045) (owner: 10Kosta Harlan) [12:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:26] T388690: Update tests to use the OpenAPI 3.0 spec - https://phabricator.wikimedia.org/T388690 [12:34:26] T404295: Cxserver API docs for v2/page/sourcelang/targetlang/title/revision has invalid revision example - https://phabricator.wikimedia.org/T404295 [12:34:26] T391703: Replace jsduck with JSDoc in CX Server - https://phabricator.wikimedia.org/T391703 [12:34:27] T426605: cxserver: Update packages with security vulnerabilities - https://phabricator.wikimedia.org/T426605 [12:34:37] PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:44] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1017.eqiad.wmnet with OS trixie [12:34:58] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [12:34:59] (03CR) 10Brouberol: [C:03+2] Upgrade kafka-jumbo1018 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289951 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [12:35:17] (03Merged) 10jenkins-bot: hCaptcha: Finish group1 account creation rollout + itwiki/hewiki for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290727 (https://phabricator.wikimedia.org/T426045) (owner: 10Kosta Harlan) [12:35:26] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:35:29] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1290727|hCaptcha: Finish group1 account creation rollout + itwiki/hewiki for mobile apps (T426045 T425354)]] [12:35:34] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:35:35] T426045: Roll out hCaptcha for use on mobile app clients for Group 1 - hewiki & itwiki - https://phabricator.wikimedia.org/T426045 [12:35:35] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [12:35:47] RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [12:36:15] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3066.esams.wmnet} and A:cp [12:36:45] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:37:19] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1290727|hCaptcha: Finish group1 account creation rollout + itwiki/hewiki for mobile apps (T426045 T425354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:37:36] fabfur: not sure why this paged, everything seemed to be in order? [12:37:39] (03CR) 10Michael Große: "recheck" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290743 (https://phabricator.wikimedia.org/T426861) (owner: 10Michael Große) [12:37:42] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS trixie [12:37:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290743 (https://phabricator.wikimedia.org/T426861) (owner: 10Michael Große) [12:37:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1052.eqiad.wmnet [12:38:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1052.eqiad.wmnet [12:38:11] there was a slight uptick in NEL failures, but nothing which would really indicate a bigger issue [12:38:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) (owner: 10Michael Große) [12:38:55] (03CR) 10Michael Große: "recheck" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289342 (https://phabricator.wikimedia.org/T419413) (owner: 10Michael Große) [12:39:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289342 (https://phabricator.wikimedia.org/T419413) (owner: 10Michael Große) [12:39:10] !log kharlan@deploy1003 kharlan: Continuing with deployment [12:40:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1039 (T426633)', diff saved to https://phabricator.wikimedia.org/P92768 and previous config saved to /var/cache/conftool/dbconfig/20260521-124014-fceratto.json [12:41:01] moritzm: yeah I was checking the same [12:42:41] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:42:58] (03Abandoned) 10Kosta Harlan: rest-gateway: Add Vary: Origin to CORS-enabled routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287731 (https://phabricator.wikimedia.org/T426323) (owner: 10Kosta Harlan) [12:43:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:43:24] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290727|hCaptcha: Finish group1 account creation rollout + itwiki/hewiki for mobile apps (T426045 T425354)]] (duration: 07m 54s) [12:43:30] T426045: Roll out hCaptcha for use on mobile app clients for Group 1 - hewiki & itwiki - https://phabricator.wikimedia.org/T426045 [12:43:31] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [12:43:52] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:43:54] (03PS1) 10Majavah: P:ssl: Renew Toolforge Prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/1290754 [12:44:17] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:45:29] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:45:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [12:46:10] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:46:14] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1039: Repooling [12:46:15] (03CR) 10Majavah: [C:03+2] P:ssl: Renew Toolforge Prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/1290754 (owner: 10Majavah) [12:47:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1040.eqiad.wmnet with reason: Maintenance [12:47:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1040 (T426633)', diff saved to https://phabricator.wikimedia.org/P92770 and previous config saved to /var/cache/conftool/dbconfig/20260521-124707-fceratto.json [12:47:22] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:47:36] (03PS21) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [12:48:27] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3066.esams.wmnet [12:48:27] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp3066.esams.wmnet} and A:cp [12:48:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [12:49:14] !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [12:49:18] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1002.eqiad.wmnet [12:49:45] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [12:50:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:50:59] (03CR) 10Fabfur: [C:03+1] P:cache::haproxy: guard webrequest IP reputation data for beta [puppet] - 10https://gerrit.wikimedia.org/r/1290047 (https://phabricator.wikimedia.org/T426822) (owner: 10Ssingh) [12:51:52] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:52:15] (03CR) 10Ssingh: [V:03+1 C:03+2] P:cache::haproxy: guard webrequest IP reputation data for beta [puppet] - 10https://gerrit.wikimedia.org/r/1290047 (https://phabricator.wikimedia.org/T426822) (owner: 10Ssingh) [12:52:29] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1018.eqiad.wmnet with reason: host reimage [12:53:02] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:54:14] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6008.drmrs.wmnet [12:54:14] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp600[7-8].drmrs.wmnet} and A:cp [12:54:22] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1002.eqiad.wmnet [12:54:28] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3074.esams.wmnet} and A:cp [12:54:40] (03CR) 10Ssingh: [C:03+1] R:cache::upload enable TCP Fast Open [puppet] - 10https://gerrit.wikimedia.org/r/1290678 (https://phabricator.wikimedia.org/T415454) (owner: 10Slyngshede) [12:54:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [12:55:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [12:55:34] (03PS22) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [12:55:55] FIRING: [2x] SystemdUnitFailed: netbox_report_network_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:56:03] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:56:12] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 18 hosts with reason: Primary switchover x3 T426936 [12:56:16] T426936: Switchover x3 master (db2241 -> db2162) - https://phabricator.wikimedia.org/T426936 [12:56:46] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Set db2162 with weight 0 T426936', diff saved to https://phabricator.wikimedia.org/P92771 and previous config saved to /var/cache/conftool/dbconfig/20260521-125645-cwilliams.json [12:57:45] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [12:57:52] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:58:00] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:59:03] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1002.eqiad.wmnet [12:59:04] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1002.eqiad.wmnet [12:59:09] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1018.eqiad.wmnet with reason: host reimage [12:59:10] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1003.eqiad.wmnet [12:59:26] (03CR) 10CWilliams: [C:03+2] mariadb: Promote db2162 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1290679 (https://phabricator.wikimedia.org/T426936) (owner: 10Gerrit maintenance bot) [12:59:34] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2162 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1290679 (https://phabricator.wikimedia.org/T426936) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1300). [13:00:05] stephanebisson, codenamenoreste, dbrant, Neriah, and MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:12] I'm here :) [13:00:15] o/ [13:00:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040 (T426633)', diff saved to https://phabricator.wikimedia.org/P92772 and previous config saved to /var/cache/conftool/dbconfig/20260521-130018-fceratto.json [13:00:20] will it be possible to start with my changes? [13:00:23] o/ [13:00:24] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:00:29] Hey :) [13:00:52] (I also need a deployer) [13:00:58] Neriah: I’m fine with prioritizing volunteer changes ^^ [13:01:02] can your two changes be deployed together? [13:01:06] (I still need to look at them first though) [13:01:34] No worries, I'll go right after [13:01:49] Lucas_WMDE I guess they can be together [13:02:20] (03PS4) 10Neriah: Enable 'flood' user group at en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) [13:02:37] (03PS3) 10Neriah: Disable wgUseFilePatrol in ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290088 (https://phabricator.wikimedia.org/T426905) [13:02:55] (03CR) 10CWilliams: [C:03+2] mariadb: Promote db2162 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1290679 (https://phabricator.wikimedia.org/T426936) (owner: 10Gerrit maintenance bot) [13:03:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290088 (https://phabricator.wikimedia.org/T426905) (owner: 10Neriah) [13:03:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) (owner: 10Neriah) [13:03:37] okay! [13:03:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet [13:04:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [13:04:13] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1003.eqiad.wmnet [13:04:40] !log Starting x3 codfw failover from db2241 to db2162 - T426936 [13:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:44] T426936: Switchover x3 master (db2241 -> db2162) - https://phabricator.wikimedia.org/T426936 [13:04:59] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:06:10] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Promote db2162 to x3 primary T426936', diff saved to https://phabricator.wikimedia.org/P92774 and previous config saved to /var/cache/conftool/dbconfig/20260521-130609-cwilliams.json [13:06:11] (03PS23) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [13:06:12] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3074.esams.wmnet [13:06:12] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp3074.esams.wmnet} and A:cp [13:06:37] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp601[5-6].drmrs.wmnet} and A:cp [13:07:05] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [13:07:12] jmm@cumin2002 drain-node (PID 452158) is awaiting input [13:07:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet [13:07:56] (03Merged) 10jenkins-bot: Disable wgUseFilePatrol in ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290088 (https://phabricator.wikimedia.org/T426905) (owner: 10Neriah) [13:07:59] (03Merged) 10jenkins-bot: Enable 'flood' user group at en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) (owner: 10Neriah) [13:08:02] that took a while in CI o_O [13:08:13] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1290088|Disable wgUseFilePatrol in ukwiki (T426905)]], [[gerrit:1290032|Enable 'flood' user group at en.wikiversity (T426882)]] [13:08:18] T426905: Disable $wgUseFilePatrol in ukwiki - https://phabricator.wikimedia.org/T426905 [13:08:18] T426882: Add a pseudo-bot user group on English Wikiversity - https://phabricator.wikimedia.org/T426882 [13:08:41] ah, four minutes Waiting for the completion of castor-save-workspace-cache [13:08:46] hello castor my old friend… [13:09:01] I’ve come to wait for you again [13:09:12] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:09:13] PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:09:15] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:09:20] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:09:25] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [13:09:30] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [13:09:35] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:09:42] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:09:47] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:09:56] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:09:57] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, neriah: Backport for [[gerrit:1290088|Disable wgUseFilePatrol in ukwiki (T426905)]], [[gerrit:1290032|Enable 'flood' user group at en.wikiversity (T426882)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:10:02] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:10:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [13:10:06] Neriah: please test using WikimediaDebug :) [13:10:07] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:10:12] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [13:10:16] RECOVERY - MariaDB Replica Lag: backup1-codfw on db2184 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [13:10:17] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [13:10:17] testing [13:10:22] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:10:26] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Depool db2241 T426936', diff saved to https://phabricator.wikimedia.org/P92775 and previous config saved to /var/cache/conftool/dbconfig/20260521-131025-cwilliams.json [13:10:27] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:10:28] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:10:30] T426936: Switchover x3 master (db2241 -> db2162) - https://phabricator.wikimedia.org/T426936 [13:10:32] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:10:32] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1003.eqiad.wmnet [13:10:33] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1003.eqiad.wmnet [13:10:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040', diff saved to https://phabricator.wikimedia.org/P92776 and previous config saved to /var/cache/conftool/dbconfig/20260521-131033-fceratto.json [13:10:38] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:10:39] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1004.eqiad.wmnet [13:10:43] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:11:23] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:restbase [13:11:34] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [13:11:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:18] (03CR) 10Btullis: [C:03+2] changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [13:12:33] (03CR) 10Cwhite: [C:03+1] corto: set default visibility to WMF-NDA [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan) [13:13:10] (03CR) 10Ssingh: [C:03+1] "You will need two additional patches to change state: service_setup to lvs_setup and then production." [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:14:13] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:14:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [13:15:41] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1004.eqiad.wmnet [13:15:43] (03CR) 10Elukey: [C:03+1] ml-serve: update kserve/knative on prod codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289935 (https://phabricator.wikimedia.org/T426823) (owner: 10Dpogorzelski) [13:15:53] Both changes look good [13:15:56] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, neriah: Continuing with deployment [13:15:58] great, thanks! [13:16:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [13:16:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet [13:16:40] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1039: Repooling [13:16:47] (03CR) 10Lucas Werkmeister (WMDE): Enable AG on phase 2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290014 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [13:16:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [13:17:38] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1018.eqiad.wmnet with OS trixie [13:17:50] (03PS2) 10Btullis: Remove absented sqoop resources [puppet] - 10https://gerrit.wikimedia.org/r/1290716 (https://phabricator.wikimedia.org/T424355) [13:18:28] (03PS2) 10JavierMonton: stream: webrequest.page_view [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) [13:18:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [13:18:50] FIRING: [7x] ProbeDown: Service ganeti1030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:22] (03PS2) 10Sbisson: Enable AG on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290014 (https://phabricator.wikimedia.org/T426871) [13:19:38] (03CR) 10Sbisson: Enable AG on phase 2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290014 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [13:20:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:20:08] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290088|Disable wgUseFilePatrol in ukwiki (T426905)]], [[gerrit:1290032|Enable 'flood' user group at en.wikiversity (T426882)]] (duration: 11m 55s) [13:20:14] T426905: Disable $wgUseFilePatrol in ukwiki - https://phabricator.wikimedia.org/T426905 [13:20:14] T426882: Add a pseudo-bot user group on English Wikiversity - https://phabricator.wikimedia.org/T426882 [13:20:23] stephanebisson: over to you :) [13:20:32] Thanks, on it [13:20:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [13:20:35] thanks :) [13:20:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040', diff saved to https://phabricator.wikimedia.org/P92778 and previous config saved to /var/cache/conftool/dbconfig/20260521-132041-fceratto.json [13:20:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290014 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [13:22:04] (03Merged) 10jenkins-bot: Enable AG on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290014 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [13:22:11] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1004.eqiad.wmnet [13:22:12] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1004.eqiad.wmnet [13:22:18] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1005.eqiad.wmnet [13:22:19] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1290014|Enable AG on phase 2 wikis (T426871)]] [13:22:24] T426871: Enable AG experiment on phase 2 wikis - https://phabricator.wikimedia.org/T426871 [13:23:09] (03CR) 10JavierMonton: stream: webrequest.page_view (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) (owner: 10JavierMonton) [13:23:16] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [13:23:28] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2241: Upgrading db2241.codfw.wmnet [13:23:36] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2241: Upgrading db2241.codfw.wmnet [13:23:50] RESOLVED: [7x] ProbeDown: Service ganeti1030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:58] (03CR) 10Elukey: "The envoy timeouts are related to a single HTTP request towards an evaluator, and IIUC the orchestrator can fan-out multiple requests. On " [puppet] - 10https://gerrit.wikimedia.org/r/1290712 (owner: 10Elukey) [13:24:04] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1290014|Enable AG on phase 2 wikis (T426871)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:24:13] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:24:14] (03PS1) 10Marostegui: db2218: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1290764 [13:24:23] Testing... [13:24:55] (03CR) 10Cathal Mooney: [C:03+1] "great stuff!" [puppet] - 10https://gerrit.wikimedia.org/r/1289919 (https://phabricator.wikimedia.org/T426837) (owner: 10Majavah) [13:25:02] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:25:08] (03CR) 10Marostegui: [C:03+2] db2218: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1290764 (owner: 10Marostegui) [13:25:10] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2241.codfw.wmnet with OS trixie [13:25:10] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:26:13] (03CR) 10Elukey: "See what I wrote on Slack, but the TL;DR is that I am very sure that the main problem here is the orchestrator lagging behind enforcing th" [puppet] - 10https://gerrit.wikimedia.org/r/1290712 (owner: 10Elukey) [13:26:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [13:26:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [13:27:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:27:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2218: repool after maintenance [13:27:18] !log sbisson@deploy1003 sbisson: Continuing with deployment [13:27:20] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1005.eqiad.wmnet [13:27:46] PROBLEM - Host cp6015 is DOWN: PING CRITICAL - Packet loss = 100% [13:28:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [13:28:50] FIRING: [14x] ProbeDown: Service ganeti1030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [13:30:20] (03PS4) 10Raymond Ndibe: handle missing kubeconfig error in replica_cnf_backend [puppet] - 10https://gerrit.wikimedia.org/r/1288526 (https://phabricator.wikimedia.org/T424207) [13:30:32] (03PS1) 10Elukey: role::cache::haproxy: move webrequest ip reputation exp to all magru [puppet] - 10https://gerrit.wikimedia.org/r/1290767 (https://phabricator.wikimedia.org/T402512) [13:30:46] (03CR) 10Btullis: [C:03+2] Remove absented sqoop resources [puppet] - 10https://gerrit.wikimedia.org/r/1290716 (https://phabricator.wikimedia.org/T424355) (owner: 10Btullis) [13:30:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040 (T426633)', diff saved to https://phabricator.wikimedia.org/P92780 and previous config saved to /var/cache/conftool/dbconfig/20260521-133048-fceratto.json [13:30:55] (03PS2) 10Elukey: role::cache::haproxy: move webrequest ip reputation exp to all magru [puppet] - 10https://gerrit.wikimedia.org/r/1290767 (https://phabricator.wikimedia.org/T402512) [13:31:05] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290767 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [13:31:09] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1048.eqiad.wmnet with reason: Maintenance [13:31:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1048 (T426633)', diff saved to https://phabricator.wikimedia.org/P92781 and previous config saved to /var/cache/conftool/dbconfig/20260521-133116-fceratto.json [13:31:29] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290014|Enable AG on phase 2 wikis (T426871)]] (duration: 09m 11s) [13:31:34] T426871: Enable AG experiment on phase 2 wikis - https://phabricator.wikimedia.org/T426871 [13:32:49] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1005.eqiad.wmnet [13:32:50] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1005.eqiad.wmnet [13:32:53] stephanebisson: anything else, or can dbrant take over? [13:32:56] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1006.eqiad.wmnet [13:33:22] Yes, all good here. Go ahead [13:33:28] proceeding [13:33:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dbrant@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290035 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [13:33:50] RESOLVED: [13x] ProbeDown: Service ganeti1031:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:13] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:43] (03Merged) 10jenkins-bot: docroot: Remove non-wikipedias from digital asset links. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290035 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [13:34:58] !log dbrant@deploy1003 Started scap sync-world: Backport for [[gerrit:1290035|docroot: Remove non-wikipedias from digital asset links. (T426010 T385520)]] [13:35:04] T426010: Enable integration with Credential Manager - https://phabricator.wikimedia.org/T426010 [13:35:05] T385520: Deploy DAL files for seamless credential sharing in Chrome - https://phabricator.wikimedia.org/T385520 [13:35:40] MichaelG_WMF: do you need a deployer btw? [13:35:53] Lucas_WMDE: I do! [13:36:00] ok, I can deploy [13:36:02] can they all go together? [13:36:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [13:36:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [13:36:24] Lucas_WMDE: Yes. The important one is the WikimediaEvents change, I can test that one. The other two are dependencies. [13:36:30] (starts mentally playing https://en.wikisource.org/wiki/Songs_and_Lyrics_(Lehrer)/We_Will_All_Go_Together_When_We_Go) [13:36:32] ok! [13:36:45] let’s start the gate-and-submit then [13:36:46] !log dbrant@deploy1003 dbrant: Backport for [[gerrit:1290035|docroot: Remove non-wikipedias from digital asset links. (T426010 T385520)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:36:51] Thank you :blush [13:36:53] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290743 (https://phabricator.wikimedia.org/T426861) (owner: 10Michael Große) [13:36:57] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) (owner: 10Michael Große) [13:37:01] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289342 (https://phabricator.wikimedia.org/T419413) (owner: 10Michael Große) [13:37:33] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11944981 (10ssingh) >>! In T414411#11914980, @RobH wrote: > Scheduled a new site visit for them to go out this Friday @ 8AM Singapore Time so my Thursday @ 4PM. > > 1-260037210462 Hi @RobH: Was this... [13:37:39] !log dbrant@deploy1003 dbrant: Continuing with deployment [13:37:56] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/ml-serve-codfw: maintenance [13:37:59] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1006.eqiad.wmnet [13:38:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048 (T426633)', diff saved to https://phabricator.wikimedia.org/P92782 and previous config saved to /var/cache/conftool/dbconfig/20260521-133815-fceratto.json [13:38:42] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in codfw/ml-serve-codfw: maintenance [13:38:50] FIRING: [14x] ProbeDown: Service ganeti1031:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet [13:40:05] (03CR) 10CDanis: [C:03+1] role::cache::haproxy: move webrequest ip reputation exp to all magru [puppet] - 10https://gerrit.wikimedia.org/r/1290767 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [13:40:13] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289967 (owner: 10Muehlenhoff) [13:40:28] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2241.codfw.wmnet with reason: host reimage [13:41:27] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [13:41:42] (03Merged) 10jenkins-bot: composer.json: Updated symfony/yaml from 7.4.6 to 7.4.12 [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290743 (https://phabricator.wikimedia.org/T426861) (owner: 10Michael Große) [13:41:46] (03Merged) 10jenkins-bot: Skip init.test.js test if VisualEditor not installed [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) (owner: 10Michael Große) [13:41:49] (03Merged) 10jenkins-bot: fix: simplify to show only one icon type for password reveal [extensions/WikimediaEvents] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289342 (https://phabricator.wikimedia.org/T419413) (owner: 10Michael Große) [13:41:51] !log dbrant@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290035|docroot: Remove non-wikipedias from digital asset links. (T426010 T385520)]] (duration: 06m 52s) [13:41:57] T426010: Enable integration with Credential Manager - https://phabricator.wikimedia.org/T426010 [13:41:57] T385520: Deploy DAL files for seamless credential sharing in Chrome - https://phabricator.wikimedia.org/T385520 [13:42:11] jmm@cumin2002 drain-node (PID 475255) is awaiting input [13:42:12] MichaelG_WMF: all yours! [13:42:19] thanks! I’ll deploy [13:42:50] sheesh, a lot of depends-on warnings in the output *reads* [13:42:55] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1006.eqiad.wmnet [13:42:56] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1006.eqiad.wmnet [13:43:02] “Change '1290028' has 6 Depends-On relationship(s) (1290037, 1290026, 1290021, 1290015, 1290005, 1289973) but none were deemed relevant by the dependency analysis rules. This may be unexpected.” [13:43:02] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1007.eqiad.wmnet [13:43:15] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [13:43:45] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1290743|composer.json: Updated symfony/yaml from 7.4.6 to 7.4.12 (T426861)]], [[gerrit:1289347|Skip init.test.js test if VisualEditor not installed (T426740)]], [[gerrit:1289342|fix: simplify to show only one icon type for password reveal (T419413)]] [13:43:50] RESOLVED: [13x] ProbeDown: Service ganeti1032:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:43:53] T426861: symfony/yaml security issues blocking vendor - https://phabricator.wikimedia.org/T426861 [13:43:53] T426740: QUnit test "ext.confirmEdit.hCaptcha.secureEnclave" fails on unrelated WikimediaEvents change - https://phabricator.wikimedia.org/T426740 [13:43:53] T419413: Add reveal password action to mobile account creation form - https://phabricator.wikimedia.org/T419413 [13:44:21] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [13:44:29] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2241.codfw.wmnet with reason: host reimage [13:44:33] (03PS1) 10Btullis: Allow incoming traffic to port 7001 and 9999 for wdqs::alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) [13:45:20] (03PS2) 10Btullis: Allow incoming traffic to port 7001 and 9999 for wdqs::alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) [13:45:28] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, migr: Backport for [[gerrit:1290743|composer.json: Updated symfony/yaml from 7.4.6 to 7.4.12 (T426861)]], [[gerrit:1289347|Skip init.test.js test if VisualEditor not installed (T426740)]], [[gerrit:1289342|fix: simplify to show only one icon type for password reveal (T419413)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes [13:45:28] can now be verified there. [13:45:34] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:45:50] MichaelG_WMF: please test ^^ [13:45:53] I assumed that "wmf.3-CI is green" should be sufficient for me having picked all the relevant dependencies [13:45:58] will test! [13:45:58] (03CR) 10Klausman: [C:03+1] ml-serve(grpc): step 2, add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:46:36] Lucas_WMDE: Looks good! [13:46:56] jmm@cumin2002 drain-node (PID 475255) is awaiting input [13:46:57] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, migr: Continuing with deployment [13:46:59] \o/ [13:47:28] (03CR) 10Btullis: Allow incoming traffic to port 7001 and 9999 for wdqs::alternatives (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) (owner: 10Btullis) [13:47:53] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [13:48:05] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1007.eqiad.wmnet [13:48:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048', diff saved to https://phabricator.wikimedia.org/P92784 and previous config saved to /var/cache/conftool/dbconfig/20260521-134822-fceratto.json [13:48:50] FIRING: [15x] ProbeDown: Service ganeti1032:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:09] (03Abandoned) 10Elukey: profile::services_proxy::envoy: lower Wikifunctions' timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1290712 (owner: 10Elukey) [13:51:06] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290743|composer.json: Updated symfony/yaml from 7.4.6 to 7.4.12 (T426861)]], [[gerrit:1289347|Skip init.test.js test if VisualEditor not installed (T426740)]], [[gerrit:1289342|fix: simplify to show only one icon type for password reveal (T419413)]] (duration: 07m 20s) [13:51:12] T426861: symfony/yaml security issues blocking vendor - https://phabricator.wikimedia.org/T426861 [13:51:13] T426740: QUnit test "ext.confirmEdit.hCaptcha.secureEnclave" fails on unrelated WikimediaEvents change - https://phabricator.wikimedia.org/T426740 [13:51:13] T419413: Add reveal password action to mobile account creation form - https://phabricator.wikimedia.org/T419413 [13:51:36] !log UTC afternoon backport+config window done [13:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:10] Thank you :) [13:52:15] (03CR) 10Fabfur: [C:03+1] "LGTM! when ready to do a test deployment remember to disable puppet on A:cp except for the impacted hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1290708 (https://phabricator.wikimedia.org/T426323) (owner: 10Clément Goubert) [13:52:24] (03PS3) 10Elukey: role::cache::haproxy: move webrequest ip reputation exp to all magru [puppet] - 10https://gerrit.wikimedia.org/r/1290767 (https://phabricator.wikimedia.org/T402512) [13:53:12] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1007.eqiad.wmnet [13:53:13] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1007.eqiad.wmnet [13:53:19] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1008.eqiad.wmnet [13:53:50] FIRING: [12x] ProbeDown: Service restbase2026-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:53] np :) [13:54:10] (03CR) 10CDanis: [C:03+1] role::cache::haproxy: move webrequest ip reputation exp to all magru [puppet] - 10https://gerrit.wikimedia.org/r/1290767 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [13:54:40] (03CR) 10Fabfur: [C:03+1] role::cache::haproxy: move webrequest ip reputation exp to all magru [puppet] - 10https://gerrit.wikimedia.org/r/1290767 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [13:55:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:55:19] (03CR) 10Elukey: [C:03+2] role::cache::haproxy: move webrequest ip reputation exp to all magru [puppet] - 10https://gerrit.wikimedia.org/r/1290767 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [13:58:22] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1008.eqiad.wmnet [13:58:30] (03PS1) 10Majavah: P:cache::haproxy: Remove no-op realm switch [puppet] - 10https://gerrit.wikimedia.org/r/1290780 [13:58:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048', diff saved to https://phabricator.wikimedia.org/P92786 and previous config saved to /var/cache/conftool/dbconfig/20260521-135830-fceratto.json [13:58:50] RESOLVED: [12x] ProbeDown: Service restbase2026-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet [13:59:02] (03CR) 10CDanis: [C:03+1] P:cache::haproxy: Remove no-op realm switch [puppet] - 10https://gerrit.wikimedia.org/r/1290780 (owner: 10Majavah) [13:59:51] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [14:00:07] (03CR) 10Elukey: docker_registry: allow multiple docker instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [14:00:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:00:34] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8571/console" [puppet] - 10https://gerrit.wikimedia.org/r/1290780 (owner: 10Majavah) [14:01:08] (03CR) 10Majavah: [V:03+1 C:03+2] P:cache::haproxy: Remove no-op realm switch [puppet] - 10https://gerrit.wikimedia.org/r/1290780 (owner: 10Majavah) [14:01:16] (03CR) 10Elukey: docker_registry: allow multiple docker instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [14:01:28] (03PS1) 10Krinkle: mmv: Fix missing or stale arrow and counter controls [extensions/MultimediaViewer] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290781 (https://phabricator.wikimedia.org/T426960) [14:01:41] (03CR) 10FNegri: [C:03+2] "Merging myself as @rolisaemeka-ctr@wikimedia.org does not have +2 rights." [puppet] - 10https://gerrit.wikimedia.org/r/1288526 (https://phabricator.wikimedia.org/T424207) (owner: 10Raymond Ndibe) [14:01:55] (03CR) 10FNegri: [C:03+2] "Merging myself as @rolisaemeka-ctr@wikimedia.org does not have +2 rights." [puppet] - 10https://gerrit.wikimedia.org/r/1288521 (https://phabricator.wikimedia.org/T424209) (owner: 10Raymond Ndibe) [14:01:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290781 (https://phabricator.wikimedia.org/T426960) (owner: 10Krinkle) [14:02:09] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2241.codfw.wmnet with OS trixie [14:03:17] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1008.eqiad.wmnet [14:03:18] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1008.eqiad.wmnet [14:03:23] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1009.eqiad.wmnet [14:03:50] FIRING: [12x] ProbeDown: Service restbase2027-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:13] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2241: Migration of db2241.codfw.wmnet completed [14:04:37] (03CR) 10JHathaway: [C:03+1] "great, thanks for the additional fixes!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1287905 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [14:05:33] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11945112 (10RobH) Apologies, this ran super late and I neglected to update the task accordingly. The mainboard swap was successful but it appears of the two CPUs, one of them has failed. Dell SG is... [14:06:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1011.eqiad.wmnet - https://phabricator.wikimedia.org/T426806#11945117 (10VRiley-WMF) 05Open→03Resolved [14:06:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet [14:07:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1035.eqiad.wmnet [14:07:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1053.eqiad.wmnet [14:08:12] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:08:27] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1009.eqiad.wmnet [14:08:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048 (T426633)', diff saved to https://phabricator.wikimedia.org/P92788 and previous config saved to /var/cache/conftool/dbconfig/20260521-140837-fceratto.json [14:08:50] RESOLVED: [13x] ProbeDown: Service ganeti1035:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2039.codfw.wmnet with reason: Maintenance [14:09:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2039 (T426633)', diff saved to https://phabricator.wikimedia.org/P92789 and previous config saved to /var/cache/conftool/dbconfig/20260521-140906-fceratto.json [14:11:10] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791#11945133 (10Marostegui) Broken disk should be blinking now [14:11:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1053.eqiad.wmnet [14:11:51] (03CR) 10Elukey: [C:03+2] redfish: add add_account method for RedfishDell [software/spicerack] - 10https://gerrit.wikimedia.org/r/1287905 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [14:11:56] 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations, 06Machine-Learning-Team, 13Patch-For-Review: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11945134 (10elukey) I had a chat with @JMeybohm the other day, and he pointed out a very wise thing - when w... [14:12:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2218: repool after maintenance [14:12:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:13:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1011.eqiad.wmnet [14:13:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:14:31] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1009.eqiad.wmnet [14:14:32] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1009.eqiad.wmnet [14:14:38] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1010.eqiad.wmnet [14:14:55] FIRING: [13x] ProbeDown: Service ganeti1035:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:26] (03PS2) 10Brouberol: Set JDK21 as default for all kafka-jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/1289959 (https://phabricator.wikimedia.org/T426835) [14:15:29] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289959 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [14:15:59] (03CR) 10CI reject: [V:04-1] Set JDK21 as default for all kafka-jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/1289959 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [14:16:44] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791#11945152 (10Marostegui) The disk has been replaced but it needs a bit of work to make it part of the array as it seems to have old metadata: ` root@dbproxy2005:~# cat /proc/mdstat Personalities : [... [14:16:46] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 167831928 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:16:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1053.eqiad.wmnet [14:17:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1053.eqiad.wmnet [14:17:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1054.eqiad.wmnet [14:17:35] (03PS3) 10Brouberol: Set JDK21 as default for all kafka-jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/1289959 (https://phabricator.wikimedia.org/T426835) [14:18:02] (03PS9) 10Gkyziridis: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) [14:18:23] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289959 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [14:18:46] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3474536 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:18:56] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11945156 (10ssingh) >>! In T414411#11945112, @RobH wrote: > Apologies, this ran super late and I neglected to update the task accordingly. > > The mainboard swap was successful but it appears of the... [14:19:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline, feel free to ignore :-)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [14:19:39] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:19:44] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:19:55] RESOLVED: [13x] ProbeDown: Service ganeti1035:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1054.eqiad.wmnet [14:20:34] (03CR) 10Pppery: "No context as to what this is." [puppet] - 10https://gerrit.wikimedia.org/r/1290097 (owner: 10Ncmonitor) [14:20:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039 (T426633)', diff saved to https://phabricator.wikimedia.org/P92792 and previous config saved to /var/cache/conftool/dbconfig/20260521-142037-fceratto.json [14:20:56] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T426902#11945160 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:20:57] (03CR) 10Brouberol: [C:03+2] Set JDK21 as default for all kafka-jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/1289959 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [14:21:18] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T426899#11945163 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:21:38] (03CR) 10Gkyziridis: rest-gateway: Configure qwen3-14b in rest-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [14:21:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1011.eqiad.wmnet [14:21:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:21:48] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:24:42] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1010.eqiad.wmnet [14:25:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1054.eqiad.wmnet [14:25:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1054.eqiad.wmnet [14:26:25] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:26:31] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:26:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [14:27:03] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.cdn.roll-reboot (exit_code=1) rolling reboot on P{cp601[5-6].drmrs.wmnet} and A:cp [14:29:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [14:29:55] FIRING: [12x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:57] !log elukey@cumin1003 START - Cookbook sre.pki.restart-reboot rolling reboot on A:pki [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1430) [14:30:21] !log elukey@cumin1003 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet. on all recursors [14:30:25] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet. on all recursors [14:30:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039', diff saved to https://phabricator.wikimedia.org/P92793 and previous config saved to /var/cache/conftool/dbconfig/20260521-143045-fceratto.json [14:31:02] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11945184 (10MatthewVernon) 05Open→03Resolved This change has been implemented in puppet, so this task can be closed. [14:31:08] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968 (10SLyngshede-WMF) 03NEW [14:31:28] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791#11945199 (10Marostegui) a:03Marostegui I've stopped those fake arrays and copied the table partition from sda and added it back to the array: ` root@dbproxy2005:~# sfdisk -d /dev/sda | sfdisk /de... [14:32:33] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1010.eqiad.wmnet [14:32:34] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1010.eqiad.wmnet [14:32:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1012.eqiad.wmnet [14:32:39] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1011.eqiad.wmnet [14:33:27] (03CR) 10JMeybohm: [C:03+1] docker_registry: allow multiple docker instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [14:34:11] (03CR) 10Elukey: [C:03+2] docker_registry: allow multiple docker instances [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [14:34:55] RESOLVED: [12x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [14:35:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet [14:36:42] 10SRE-swift-storage, 06Data-Persistence, 10Thumbor: Commons file page should use standard thumb sizes - https://phabricator.wikimedia.org/T426970 (10MatthewVernon) 03NEW [14:37:42] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1011.eqiad.wmnet [14:37:57] !log elukey@cumin1003 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet. on all recursors [14:38:01] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet. on all recursors [14:38:47] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-File-management, 10Thumbor: Commons file page should use standard thumb sizes - https://phabricator.wikimedia.org/T426970#11945261 (10A_smart_kitten) [14:39:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1012.eqiad.wmnet [14:39:55] FIRING: [13x] ProbeDown: Service ganeti1027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:40:55] FIRING: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039', diff saved to https://phabricator.wikimedia.org/P92795 and previous config saved to /var/cache/conftool/dbconfig/20260521-144055-fceratto.json [14:41:17] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:41:28] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:42:08] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1011.eqiad.wmnet [14:42:09] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1011.eqiad.wmnet [14:42:09] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-eqiad [14:42:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [14:42:54] !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{ml-serve1001.eqiad.wmnet} and (A:ml-serve-master-eqiad or A:ml-serve-worker-eqiad) [14:42:57] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1001.eqiad.wmnet [14:44:13] FIRING: JobUnavailable: Reduced availability for job cfssl in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:54] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:44:55] RESOLVED: [13x] ProbeDown: Service ganeti1027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:02] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:46:06] (03CR) 10JHathaway: [C:03+1] wikireplicas: Migrate from ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [14:46:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [14:47:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1013.eqiad.wmnet [14:47:18] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290803 (https://phabricator.wikimedia.org/T425367) [14:47:46] (03PS2) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290803 (https://phabricator.wikimedia.org/T425367) [14:47:51] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:xe-0/2/1 (Core: fasw2-c8a-codfw:xe-0/0/47 {#11519}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:47:56] FIRING: SwitchCoreInterfaceDown: Switch core interface down - fasw2-c8a-codfw:xe-0/0/47 (Core: pfw1-codfw:xe-0/2/1 {#11519}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c8a-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:48:00] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1001.eqiad.wmnet [14:49:42] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2241: Migration of db2241.codfw.wmnet completed [14:49:43] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:49:45] jouncebot: nowandnext [14:49:45] For the next 0 hour(s) and 10 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1430) [14:49:45] In 0 hour(s) and 10 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1500) [14:50:22] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-build1001.eqiad.wmnet [14:51:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2039 (T426633)', diff saved to https://phabricator.wikimedia.org/P92797 and previous config saved to /var/cache/conftool/dbconfig/20260521-145103-fceratto.json [14:51:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance [14:51:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2040 (T426633)', diff saved to https://phabricator.wikimedia.org/P92798 and previous config saved to /var/cache/conftool/dbconfig/20260521-145132-fceratto.json [14:52:41] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for DiscussionTools on Group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290805 (https://phabricator.wikimedia.org/T426039) [14:52:50] (03CR) 10Scott French: [C:03+1] "Thanks, Jasmine!" [puppet] - 10https://gerrit.wikimedia.org/r/1289022 (https://phabricator.wikimedia.org/T426688) (owner: 10Jasmine) [14:52:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:xe-0/2/1 (Core: fasw2-c8a-codfw:xe-0/0/47 {#11519}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:53:02] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - fasw2-c8a-codfw:xe-0/0/47 (Core: pfw1-codfw:xe-0/2/1 {#11519}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c8a-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:53:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290805 (https://phabricator.wikimedia.org/T426039) (owner: 10Dreamy Jazz) [14:53:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [14:53:05] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1001.eqiad.wmnet [14:53:06] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1001.eqiad.wmnet [14:53:06] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{ml-serve1001.eqiad.wmnet} and (A:ml-serve-master-eqiad or A:ml-serve-worker-eqiad) [14:53:09] (03PS4) 10Jforrester: Provide abstractwiki-rust, using Trixie-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) [14:53:09] (03CR) 10Jforrester: Provide abstractwiki-rust, using Trixie-backports (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [14:53:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [14:53:16] (03CR) 10Kosta Harlan: [C:03+1] hCaptcha: Enable for DiscussionTools on Group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290805 (https://phabricator.wikimedia.org/T426039) (owner: 10Dreamy Jazz) [14:53:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet [14:53:25] (03CR) 10Jforrester: Provide abstractwiki-rust, using Trixie-backports (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [14:53:38] (03PS1) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [14:53:51] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:xe-0/2/1 (Core: fasw2-c8a-codfw:xe-0/0/47 {#11519}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:53:56] FIRING: SwitchCoreInterfaceDown: Switch core interface down - fasw2-c8a-codfw:xe-0/0/47 (Core: pfw1-codfw:xe-0/2/1 {#11519}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c8a-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:54:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1013.eqiad.wmnet [14:54:11] (03Merged) 10jenkins-bot: hCaptcha: Enable for DiscussionTools on Group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290805 (https://phabricator.wikimedia.org/T426039) (owner: 10Dreamy Jazz) [14:54:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [14:54:27] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1290805|hCaptcha: Enable for DiscussionTools on Group 0 wikis (T426039)]] [14:54:31] T426039: hCaptcha DiscussionTools: Rollout to WMF wikis - https://phabricator.wikimedia.org/T426039 [14:54:55] FIRING: [12x] ProbeDown: Service restbase2031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:54:59] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-build1001.eqiad.wmnet [14:55:27] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [14:56:13] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1290805|hCaptcha: Enable for DiscussionTools on Group 0 wikis (T426039)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:57:19] !log Disabling puppet on A:cp-text - T426323 [14:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:23] T426323: CX dashboard can't load page collections on some wikis (blocked by CORS) - https://phabricator.wikimedia.org/T426323 [14:57:26] (03PS2) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [14:57:51] (03PS2) 10Jforrester: mmv: Fix missing or stale arrow and counter controls [extensions/MultimediaViewer] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290781 (https://phabricator.wikimedia.org/T426960) (owner: 10Krinkle) [14:57:54] (03CR) 10Clément Goubert: [C:03+2] gateway-check: Temp route linkrecommendation to api-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1290708 (https://phabricator.wikimedia.org/T426323) (owner: 10Clément Goubert) [14:58:09] jmm@cumin2002 drain-node (PID 523598) is awaiting input [14:58:13] (03PS3) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [14:58:19] (03PS1) 10Elukey: docker_registry: move the /ml prefix to its new S3 backend [puppet] - 10https://gerrit.wikimedia.org/r/1290808 (https://phabricator.wikimedia.org/T420978) [14:58:38] (03CR) 10Jforrester: "Re-cherry-picked so we get the git hash for blame-storming." [extensions/MultimediaViewer] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290781 (https://phabricator.wikimedia.org/T426960) (owner: 10Krinkle) [14:58:39] (03CR) 10Muehlenhoff: "Merging" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [14:58:40] (03CR) 10Muehlenhoff: [C:03+2] Provide abstractwiki-rust, using Trixie-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [14:58:43] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Provide abstractwiki-rust, using Trixie-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [14:58:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:xe-0/2/1 (Core: fasw2-c8a-codfw:xe-0/0/47 {#11519}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:59:01] !log elukey@cumin1003 END (PASS) - Cookbook sre.pki.restart-reboot (exit_code=0) rolling reboot on A:pki [14:59:13] RESOLVED: JobUnavailable: Reduced availability for job cfssl in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:19] (03PS1) 10Clément Goubert: Revert "gateway-check: Temp route linkrecommendation to api-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1290810 [14:59:51] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:ge-0/2/1 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:59:55] RESOLVED: [56x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:05] hashar and andre: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1500) [15:00:12] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet [15:00:15] (03CR) 10JHathaway: "Yeah, I didn't really look at the content, I was mechanically renaming. We will shortly have a [profile::mariadb::firewall](https://gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:00:22] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [15:00:31] !log elukey@cumin1003 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master [15:00:55] !log elukey@cumin1003 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors [15:00:55] FIRING: [47x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors [15:02:11] jmm@cumin2002 drain-node (PID 524139) is awaiting input [15:02:28] PROBLEM - Host lsw1-c1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:02:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2034.codfw.wmnet [15:02:46] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:02:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [15:02:58] PROBLEM - Host scs-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:03:03] (03CR) 10Hnowlan: [C:03+2] corto: set default visibility to WMF-NDA [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan) [15:03:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040 (T426633)', diff saved to https://phabricator.wikimedia.org/P92799 and previous config saved to /var/cache/conftool/dbconfig/20260521-150308-fceratto.json [15:03:31] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:04:38] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290805|hCaptcha: Enable for DiscussionTools on Group 0 wikis (T426039)]] (duration: 10m 11s) [15:04:42] T426039: hCaptcha DiscussionTools: Rollout to WMF wikis - https://phabricator.wikimedia.org/T426039 [15:04:48] (03CR) 10Ottomata: stream: webrequest.page_view (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) (owner: 10JavierMonton) [15:04:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:ge-0/2/1 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:04:55] FIRING: [54x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:05:17] !log elukey@cumin1003 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors [15:05:20] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors [15:05:41] (03PS1) 10Sbisson: Article Guidance: enable experiment on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290813 (https://phabricator.wikimedia.org/T426871) [15:05:55] RESOLVED: [47x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1014.eqiad.wmnet [15:07:02] (03PS1) 10Daimona Eaytoy: tables-catalog: Rename ce_worklist_articles to ce_invitation_list_articles [puppet] - 10https://gerrit.wikimedia.org/r/1290814 (https://phabricator.wikimedia.org/T426102) [15:07:16] (03CR) 10Ssingh: [C:03+1] Revert "gateway-check: Temp route linkrecommendation to api-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1290810 (owner: 10Clément Goubert) [15:07:18] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.87 ms [15:07:30] RECOVERY - Host lsw1-c1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms [15:07:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master [15:08:00] RECOVERY - Host scs-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.61 ms [15:09:31] (03CR) 10Fabfur: [C:03+1] Revert "gateway-check: Temp route linkrecommendation to api-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1290810 (owner: 10Clément Goubert) [15:09:55] FIRING: [56x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:12] (03CR) 10Clément Goubert: [C:03+2] Revert "gateway-check: Temp route linkrecommendation to api-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1290810 (owner: 10Clément Goubert) [15:10:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [15:10:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet [15:10:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2034.codfw.wmnet [15:10:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2034.codfw.wmnet [15:11:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1045.eqiad.wmnet [15:11:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1014.eqiad.wmnet [15:12:50] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:13:12] PROBLEM - Host lsw1-c3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:13:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040', diff saved to https://phabricator.wikimedia.org/P92800 and previous config saved to /var/cache/conftool/dbconfig/20260521-151316-fceratto.json [15:13:31] RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:13:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - fasw2-c8a-codfw:xe-0/0/47 (Core: pfw1-codfw:xe-0/2/1 {#11519}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c8a-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:14:13] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:14:29] (03PS1) 10Muehlenhoff: Switch pc2022 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1290817 (https://phabricator.wikimedia.org/T421705) [15:14:55] RESOLVED: [55x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290817 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [15:15:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1045.eqiad.wmnet [15:15:59] 06SRE, 10Wikimedia-Mailing-lists: New mailing list for the latam tech community - https://phabricator.wikimedia.org/T426803#11945460 (10Ladsgroup) I think the name could be improved. Names are basically impossible to change after creation, so it'd be great if we spent a bit coming up with a more standardized n... [15:17:04] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.94 ms [15:17:20] RECOVERY - Host lsw1-c3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.09 ms [15:19:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1290080 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:19:05] !log Enabling puppet on A:cp-text - T426323 [15:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:10] T426323: CX dashboard can't load page collections on some wikis (blocked by CORS) - https://phabricator.wikimedia.org/T426323 [15:19:28] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11945476 (10Dzahn) I have not confirmed the key out-of-band yet. [15:20:29] FIRING: [14x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1045.eqiad.wmnet [15:21:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1045.eqiad.wmnet [15:22:10] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11945491 (10Dzahn) @AnnieKim_WMDE Could you please send an email directly from your WMDE account to us (dzah... [15:23:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040', diff saved to https://phabricator.wikimedia.org/P92801 and previous config saved to /var/cache/conftool/dbconfig/20260521-152323-fceratto.json [15:24:12] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:24:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:39] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:24:41] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:24:42] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:24:44] PROBLEM - Host lsw1-c5-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:25:02] (03CR) 10Hashar: [C:03+1] "Thank you for the backport!" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290781 (https://phabricator.wikimedia.org/T426960) (owner: 10Krinkle) [15:25:13] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:25:27] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: fasw2-c8a-codfw:xe-0/0/47 low RX power - https://phabricator.wikimedia.org/T426824#11945503 (10Jhancock.wm) a:03Jhancock.wm got it fixed. fiber patch is good. optic in fasw2-c8a-codfw is good. it was the optic in the pfw sending bad light. cleaning he... [15:25:29] RESOLVED: [14x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:41] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:25:46] PROBLEM - Host ps1-c5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:27:52] RECOVERY - Host ps1-c5-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.22 ms [15:28:04] RECOVERY - Host lsw1-c5-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.08 ms [15:30:44] FIRING: [9x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:55] FIRING: [10x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2040 (T426633)', diff saved to https://phabricator.wikimedia.org/P92802 and previous config saved to /var/cache/conftool/dbconfig/20260521-153331-fceratto.json [15:33:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2048.codfw.wmnet with reason: Maintenance [15:34:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es2048 (T426633)', diff saved to https://phabricator.wikimedia.org/P92803 and previous config saved to /var/cache/conftool/dbconfig/20260521-153400-fceratto.json [15:34:51] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:34:54] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:34:55] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:34:58] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:35:44] FIRING: [12x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:46] PROBLEM - Host lsw1-c7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:00] PROBLEM - Host ps1-c7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:38:24] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:39:02] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:40:35] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791#11945571 (10Marostegui) Progressing nicely: ` root@dbproxy2005:~# cat /proc/mdstat Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb2[2] sda2[0] 46842572... [15:40:44] RESOLVED: [12x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:41:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es2048 (T426633)', diff saved to https://phabricator.wikimedia.org/P92804 and previous config saved to /var/cache/conftool/dbconfig/20260521-154108-fceratto.json [15:42:29] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es2048: Repooling [15:42:34] RECOVERY - Host ps1-c7-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.24 ms [15:42:38] RECOVERY - Host lsw1-c7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.10 ms [15:43:22] (03Abandoned) 10Kimberly Sarabia: Make image browsing available in Beta and TestWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288996 (https://phabricator.wikimedia.org/T421019) (owner: 10Kimberly Sarabia) [15:44:29] (03PS1) 10JavierMonton: stream: webrequest-page-view-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290822 (https://phabricator.wikimedia.org/T412978) [15:45:44] FIRING: [12x] ProbeDown: Service restbase2035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:48:20] PROBLEM - Host ssw1-d1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:23] (03CR) 10Hnowlan: [C:03+1] gateway-check: inference post-migration cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1290019 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [15:49:00] PROBLEM - Host lsw1-d1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:49:14] PROBLEM - Host ps1-d1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:49:33] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11945608 (10AnnieKim_WMDE) Done! [15:50:44] RESOLVED: [12x] ProbeDown: Service restbase2035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:44] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [15:53:39] (03CR) 10Btullis: [C:03+1] stream: webrequest-page-view-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290822 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [15:54:04] RECOVERY - Host lsw1-d1-codfw.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 30.75 ms [15:54:05] (03CR) 10JavierMonton: [C:03+2] stream: webrequest-page-view-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290822 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [15:54:08] RECOVERY - Host ps1-d1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.38 ms [15:54:24] RECOVERY - Host ssw1-d1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.08 ms [15:56:06] (03Merged) 10jenkins-bot: stream: webrequest-page-view-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290822 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [15:57:04] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:57:27] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:57:28] PROBLEM - Host lsw1-d3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:46] PROBLEM - Host lsw1-c2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:46] PROBLEM - Host lsw1-c7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:46] PROBLEM - Host lsw1-b6-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:56] PROBLEM - Host msw2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:02] PROBLEM - Host lsw1-c6-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:58:06] PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:08] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:08] PROBLEM - Host ps1-b6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:08] PROBLEM - Host ps1-c6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:08] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:10] PROBLEM - Host ps1-e1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:10] PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:12] PROBLEM - Host ps1-e3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:12] PROBLEM - Host ps1-e4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:12] PROBLEM - Host ps1-f3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:12] PROBLEM - Host ps1-e2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:12] PROBLEM - Host ps1-f1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:12] PROBLEM - Host ps1-f2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:12] PROBLEM - Host ps1-f4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:13] PROBLEM - Host ps1-e5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:14] PROBLEM - Host ssw1-a1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:58:14] PROBLEM - Host ps1-f5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:14] PROBLEM - Host re0.cr1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:58:18] PROBLEM - Host ps1-c7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:20] PROBLEM - Host scs-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:36] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:58:38] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:58:39] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:58:43] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:58:44] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:05] jhathaway and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:44] FIRING: [12x] ProbeDown: Service restbase2036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:01:06] RECOVERY - Host ps1-c2-codfw is UP: PING WARNING - Packet loss = 50%, RTA = 32.37 ms [16:01:06] RECOVERY - Host lsw1-c2-codfw.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 32.06 ms [16:02:46] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:02:48] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:02:49] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:02:52] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:04:52] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [16:04:54] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.95 ms [16:05:30] PROBLEM - Host lsw1-d5-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:05:32] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:05:35] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:05:36] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:05:39] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:05:44] RESOLVED: [12x] ProbeDown: Service restbase2036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:55] (03PS1) 10Btullis: Create a new role for the dse-k8s nodes tha are dedicated to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) [16:06:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:57] (03PS2) 10Btullis: Create a new role for the dse-k8s nodes tha are dedicated to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) [16:07:02] PROBLEM - Host ps1-d5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:07:11] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [16:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:34] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [16:10:44] FIRING: [7x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:11:48] RECOVERY - Host ps1-d5-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.35 ms [16:11:55] FIRING: [12x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:12:02] RECOVERY - Host lsw1-d5-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.01 ms [16:13:34] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:13:42] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:13:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11945704 (10Dzahn) @SLyngshede-WMF Have you received it? Not sure I have. [16:14:14] (03Restored) 10Dzahn: admin: add SSH key and kerberos for Annie Kim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1284777 (https://phabricator.wikimedia.org/T420500) (owner: 10Dzahn) [16:14:30] (03PS2) 10Dzahn: admin: add SSH key and kerberos for Annie Kim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1284777 (https://phabricator.wikimedia.org/T420500) [16:14:56] (03CR) 10Dzahn: admin: add SSH key and kerberos for Annie Kim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1284777 (https://phabricator.wikimedia.org/T420500) (owner: 10Dzahn) [16:15:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11945707 (10Dzahn) restored the abandoned patched and rebasing it to move forward with... [16:15:39] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:15:41] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:15:42] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:15:46] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:16:44] PROBLEM - Host lsw1-d7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:46] PROBLEM - Host lsw1-d2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:17:06] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:17:06] PROBLEM - Host ps1-d7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:18:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool pc2 (T421705)', diff saved to https://phabricator.wikimedia.org/P92807 and previous config saved to /var/cache/conftool/dbconfig/20260521-161808-ladsgroup.json [16:18:13] T421705: Move mariadb hosts to nftables - https://phabricator.wikimedia.org/T421705 [16:18:22] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2048: Repooling [16:18:38] RECOVERY - Host ps1-d2-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.48 ms [16:19:02] RECOVERY - Host lsw1-d2-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.03 ms [16:20:02] RECOVERY - Host ps1-d7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.90 ms [16:20:08] RECOVERY - Host lsw1-d7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.09 ms [16:20:44] RESOLVED: [12x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:21:58] PROBLEM - Host lsw1-d2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:23:28] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:24:01] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on pc2022.codfw.wmnet with reason: Move to nftables [16:24:32] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on pc1022.eqiad.wmnet with reason: Move to nftables [16:24:56] (03CR) 10Ladsgroup: [C:03+2] Switch pc2022 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1290817 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [16:26:01] (03CR) 10Dzahn: "@hashar should I schedule a meeting for us to do this?" [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [16:26:34] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:27:55] FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:32:14] jhancock@cumin2002 netbox (PID 585445) is awaiting input [16:32:55] RESOLVED: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:33:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wdqs2028 to codfw - jhancock@cumin2002" [16:33:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wdqs2028 to codfw - jhancock@cumin2002" [16:33:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:34:08] RECOVERY - Host ps1-d2-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.04 ms [16:34:16] RECOVERY - Host lsw1-d2-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.95 ms [16:34:47] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2028 [16:35:07] (03PS2) 10CWilliams: sre.mysql.pool: Add support for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) [16:35:37] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2029 [16:35:38] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2030 [16:35:39] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2031 [16:35:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2030 [16:35:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wdqs2031 [16:35:58] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2031 [16:36:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wdqs2031 [16:37:02] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2031 [16:37:13] !log fnegri@cumin1003 START - Cookbook sre.mysql.upgrade for clouddb1014.eqiad.wmnet [16:37:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wdqs2031 [16:37:55] FIRING: [8x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:20] (03PS3) 10Btullis: Create a new role for the dse-k8s nodes tha are dedicated to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) [16:39:06] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [16:40:42] !log rebooting msw-a1-codfw [16:40:44] FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:57] RECOVERY - Host ps1-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.29 ms [16:42:09] RECOVERY - Host ssw1-a1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms [16:42:55] RESOLVED: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:43:11] !log rebooting msw-b6-codfw [16:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:50] FIRING: NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [16:44:51] RECOVERY - Host re0.cr1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.94 ms [16:44:57] RECOVERY - Host scs-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.65 ms [16:45:02] !log fnegri@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for clouddb1014.eqiad.wmnet [16:48:02] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet [16:48:16] !log rebooting msw-b7-codfw [16:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:50] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [16:50:03] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.11 ms [16:50:03] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 32.34 ms [16:50:44] FIRING: [13x] ProbeDown: Service restbase1031-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2099:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:39] !log rebooting msw-c6-codfw [16:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:57] (03CR) 10Scott French: [V:03+2] "Thank you both!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1290054 (owner: 10Scott French) [16:52:00] (03CR) 10Scott French: [V:03+2 C:03+2] httpd*: Align tag with apache2 version and fix -cas Depends [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1290054 (owner: 10Scott French) [16:52:26] !log rebooting msw-c7-codfw [16:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:05] RECOVERY - Host ps1-c6-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.14 ms [16:54:07] RECOVERY - Host lsw1-c6-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.02 ms [16:54:59] RECOVERY - Host ps1-c7-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.31 ms [16:55:02] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2028 [16:55:21] RECOVERY - Host lsw1-c7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.08 ms [16:55:44] RESOLVED: [12x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:52] !log rebooting msw-d3-codfw [16:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2028 [16:58:06] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2029 [16:58:27] (03PS3) 10Scott French: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290050 [16:58:27] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2031 [16:58:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2031 [17:00:04] bd808: It is that lovely time of the day again! You are hereby commanded to deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1700). [17:00:04] swfrench-wmf: gettimeofday() says it's time for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T1700) [17:00:09] o/ [17:00:17] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290050 (owner: 10Scott French) [17:00:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2029 [17:00:44] FIRING: [14x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:03:14] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:03:17] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:03:18] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:03:21] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:03:23] (03CR) 10Scott French: [C:03+2] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290050 (owner: 10Scott French) [17:03:55] FIRING: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:05:49] (03Merged) 10jenkins-bot: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290050 (owner: 10Scott French) [17:06:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2028.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:06:47] PROBLEM - Host lsw1-c2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:07:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2029.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:07:07] RECOVERY - Host lsw1-c2-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.03 ms [17:07:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2030.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:07:22] (03PS1) 10Clément Goubert: rest-gateway: Restore strip-cookie behaviour [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290839 [17:07:26] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [17:07:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2031.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:07:59] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [17:08:00] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [17:08:13] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [17:08:14] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [17:08:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool pc2 (T421705)', diff saved to https://phabricator.wikimedia.org/P92810 and previous config saved to /var/cache/conftool/dbconfig/20260521-170823-ladsgroup.json [17:08:26] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [17:08:27] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:08:28] T421705: Move mariadb hosts to nftables - https://phabricator.wikimedia.org/T421705 [17:08:42] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:08:43] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [17:08:46] I'm not going to use my window this week [17:08:55] RESOLVED: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:08:59] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [17:09:01] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [17:09:13] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:09:27] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [17:10:15] (03CR) 10Majavah: "I would propose we merge this first to unblock my nftables-on-cloud-vps patches, and do the other cleanup after that." [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [17:10:28] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:10:30] (03CR) 10Majavah: [C:03+1] "(oops, did not mean to click resolve)" [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [17:11:15] (03CR) 10FNegri: sre.mysql.upgrade: support multiinstance hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [17:11:17] (03PS2) 10Clément Goubert: rest-gateway: Restore strip-cookie behaviour [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290839 [17:11:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs2028.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:11:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs2029.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:13:42] !log fnegri@cumin1003 START - Cookbook sre.mysql.upgrade for clouddb1016.eqiad.wmnet [17:14:01] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:14:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2030.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:14:47] PROBLEM - Host lsw1-c2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:14:55] FIRING: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:14:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2031.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:15:41] (03CR) 10Scott French: [C:03+1] rest-gateway: Restore strip-cookie behaviour [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290839 (owner: 10Clément Goubert) [17:16:04] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Restore strip-cookie behaviour [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290839 (owner: 10Clément Goubert) [17:18:24] (03Merged) 10jenkins-bot: rest-gateway: Restore strip-cookie behaviour [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290839 (owner: 10Clément Goubert) [17:19:33] RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.26 ms [17:19:45] RECOVERY - Host lsw1-c2-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.05 ms [17:19:55] FIRING: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:20:44] RESOLVED: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:22:20] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:22:53] !log fnegri@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for clouddb1016.eqiad.wmnet [17:23:28] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:24:22] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:25:05] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:25:36] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:26:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:55] FIRING: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:30:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2028.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:32:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2082:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:38] (03PS1) 10Clément Goubert: rest-gateway: strip-cookie using generic x_header_to_remove [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290847 [17:34:55] RESOLVED: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:35:58] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:36:31] * swfrench-wmf shakes fist at k8s [17:36:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs2028.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:37:09] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: strip-cookie using generic x_header_to_remove [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290847 (owner: 10Clément Goubert) [17:37:25] RESOLVED: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2082:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:26] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2028 [17:38:40] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cp6015.drmrs.wmnet with reason: hardware down [17:38:49] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11945889 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7a33507f-3ede-422e-8cb7-03bd812d23c2) set by sukhe@cumin1003 for 3 days, 0:00:00 on 1 host(s) and their services with reason: hardware down ` cp60... [17:39:16] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:39:33] (03Merged) 10jenkins-bot: rest-gateway: strip-cookie using generic x_header_to_remove [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290847 (owner: 10Clément Goubert) [17:39:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2028 [17:40:08] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:40:23] * swfrench-wmf shrugs [17:40:38] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:40:57] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:41:13] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:41:14] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:41:30] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:41:35] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:41:38] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:41:40] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:41:53] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:42:24] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:42:34] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:43:25] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:43:44] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:43:48] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:45:44] FIRING: [12x] ProbeDown: Service restbase1036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:46:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11945902 (10Jhancock.wm) @elukey @ayounsi (i tagged you both cause i don't know if this is an automation thing or a switch... [17:46:27] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:46:48] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:48:35] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:49:22] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:49:54] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:50:29] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:50:44] RESOLVED: [12x] ProbeDown: Service restbase1036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:47] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11945930 (10RobH) [17:51:01] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:51:13] PROBLEM - Host lsw1-d4-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:18] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:51:50] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:51:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11945945 (10Jhancock.wm) [17:52:09] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:52:11] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11945946 (10RobH) [17:52:15] PROBLEM - Host ps1-d4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:52:28] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:52:40] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [17:52:47] RECOVERY - Host ps1-d4-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.12 ms [17:52:55] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:52:57] RECOVERY - Host lsw1-d4-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.02 ms [17:53:08] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:53:40] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:54:41] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:54:53] PROBLEM - Host lsw1-c2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:55:01] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:55:44] FIRING: [12x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:56:46] (03PS1) 10Andrew Bogott: test_cookbook.py: Allow recording tests on invoked cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/1290858 [17:57:20] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11945980 (10RobH) Without getting into pricing on this public task the options are: * spend more money (see T426985) to replace the CPU ** we have no money left in expendables for this, so it would... [17:58:48] (03CR) 10CI reject: [V:04-1] test_cookbook.py: Allow recording tests on invoked cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/1290858 (owner: 10Andrew Bogott) [18:00:44] RESOLVED: [12x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:01:11] PROBLEM - Host lsw1-c3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:01:43] RECOVERY - Host lsw1-c3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.08 ms [18:01:47] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:01:49] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:02:34] (03PS1) 10Clément Goubert: rest-gateway: Let recommendation-api-ng set CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290865 (https://phabricator.wikimedia.org/T426323) [18:02:59] RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.82 ms [18:03:23] (03PS2) 10Andrew Bogott: test_cookbook.py: Allow recording tests on invoked cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/1290858 [18:03:23] (03PS1) 10Andrew Bogott: test_cookbook.py: format with black -l 100 -t py39 [puppet] - 10https://gerrit.wikimedia.org/r/1290867 [18:05:25] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:05:54] (03CR) 10CI reject: [V:04-1] test_cookbook.py: Allow recording tests on invoked cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/1290858 (owner: 10Andrew Bogott) [18:06:55] FIRING: [9x] ProbeDown: Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:09:54] (03PS3) 10Andrew Bogott: test_cookbook.py: Allow recording tests on invoked cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/1290858 [18:10:44] FIRING: [12x] ProbeDown: Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:11:55] FIRING: [12x] ProbeDown: Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:44] RESOLVED: [12x] ProbeDown: Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:20:27] (03CR) 10Harej: "How soon can this be merged? My rsync is capped at under 5MB/s, which is not fast enough to keep up with new additions." [puppet] - 10https://gerrit.wikimedia.org/r/1277254 (owner: 10Harej) [18:20:44] FIRING: [12x] ProbeDown: Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:24:08] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290803 (https://phabricator.wikimedia.org/T425367) (owner: 10Santiago Faci) [18:25:44] FIRING: [12x] ProbeDown: Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:54] (03PS1) 10Lerickson: Revert "[airflow-wikidata]: Add a connection for the wikidata-platform S3 user" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290871 [18:26:04] (03CR) 10CI reject: [V:04-1] Revert "[airflow-wikidata]: Add a connection for the wikidata-platform S3 user" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290871 (owner: 10Lerickson) [18:26:07] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290803 (https://phabricator.wikimedia.org/T425367) (owner: 10Santiago Faci) [18:26:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11946072 (10Dzahn) Hi @AnnieKim_WMDE for some reason I have not received the mail. Can... [18:27:55] RESOLVED: [12x] ProbeDown: Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:28:34] (03CR) 10Ottomata: [C:03+1] "Couple of unresolved comments, but I'm giving a preemptive +1 so you don't have to wait for me after you resolve them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) (owner: 10JavierMonton) [18:33:55] FIRING: [12x] ProbeDown: Service restbase1040-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:00] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11946092 (10ssingh) >>! In T414411#11945980, @RobH wrote: > Without getting into pricing on this public task the options are: > > * spend more money (see T426985) to replace the CPU > ** we have no... [18:37:46] (03Abandoned) 10Lerickson: Revert "[airflow-wikidata]: Add a connection for the wikidata-platform S3 user" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290871 (owner: 10Lerickson) [18:38:55] RESOLVED: [12x] ProbeDown: Service restbase1040-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:49] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [18:45:44] FIRING: [9x] ProbeDown: Service restbase1041-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:48:55] FIRING: [12x] ProbeDown: Service restbase1041-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:50:08] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [18:50:44] FIRING: [12x] ProbeDown: Service restbase1041-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:51:35] topranks: hey we are having some issues on the mgmt network in codfw fyi i am about to reboot msw1-codfw [18:53:08] !log rebooting msw1-codfw [18:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:55] RESOLVED: [12x] ProbeDown: Service restbase1041-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:56:07] PROBLEM - Host ps1-c6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:56:11] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:56:13] PROBLEM - Host ps1-c7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:56:15] PROBLEM - Host msw1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:56:51] PROBLEM - Host ps1-d1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:56:51] PROBLEM - Host ps1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:56:55] PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:03] PROBLEM - Host ps1-b8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:11] PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:11] PROBLEM - Host ps1-a2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:11] PROBLEM - Host ps1-a4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:11] PROBLEM - Host ps1-a6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:11] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:11] PROBLEM - Host ps1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:13] PROBLEM - Host ps1-d4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:13] PROBLEM - Host ps1-c5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:13] PROBLEM - Host ps1-c8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:15] PROBLEM - Host ps1-b1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:15] PROBLEM - Host ps1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:17] PROBLEM - Host ps1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:17] PROBLEM - Host ps1-b4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:17] PROBLEM - Host ps1-b2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:17] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:19] PROBLEM - Host ps1-d6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:19] PROBLEM - Host ps1-d7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:19] PROBLEM - Host ps1-d5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:19] PROBLEM - Host ps1-c4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:19] PROBLEM - Host ps1-d8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:19] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:57:21] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:58:55] FIRING: [11x] ProbeDown: Service restbase1042-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:59:19] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 33, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:59:25] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.34 ms [18:59:25] RECOVERY - Host msw1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [18:59:25] RECOVERY - Host lsw1-b6-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.81 ms [18:59:29] RECOVERY - Host ps1-a8-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms [18:59:29] RECOVERY - Host ps1-b2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.17 ms [18:59:31] RECOVERY - Host ps1-d8-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.57 ms [18:59:31] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [18:59:33] RECOVERY - Host ps1-a2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.12 ms [18:59:33] RECOVERY - Host ps1-a4-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.01 ms [18:59:33] RECOVERY - Host ps1-b6-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.26 ms [18:59:33] RECOVERY - Host ps1-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.38 ms [18:59:33] RECOVERY - Host ps1-c8-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.34 ms [18:59:33] RECOVERY - Host ps1-c5-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [18:59:33] RECOVERY - Host ps1-d7-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.83 ms [18:59:34] RECOVERY - Host ps1-c7-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.14 ms [18:59:34] RECOVERY - Host ps1-d5-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [18:59:35] RECOVERY - Host ps1-c6-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.78 ms [18:59:35] RECOVERY - Host ps1-c4-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.53 ms [18:59:37] PROBLEM - Host cloudsw1-b1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:59:37] PROBLEM - Host lsw1-a3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:59:37] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:59:37] PROBLEM - Host lsw1-b8-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:59:38] PROBLEM - Host lsw1-c1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:59:38] PROBLEM - Host lsw1-c2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:59:39] PROBLEM - Host msw2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:59:39] PROBLEM - Host scs-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:59:40] RECOVERY - Host ps1-d1-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.93 ms [18:59:47] RECOVERY - Host ps1-b5-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.19 ms [18:59:57] RECOVERY - Host ps1-a5-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.52 ms [18:59:57] RECOVERY - Host ps1-a6-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.31 ms [18:59:57] RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.16 ms [18:59:57] RECOVERY - Host ps1-d2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.31 ms [18:59:57] RECOVERY - Host ps1-d4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.96 ms [18:59:59] RECOVERY - Host ps1-b3-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.04 ms [18:59:59] RECOVERY - Host ps1-b4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.56 ms [18:59:59] (03Abandoned) 10Andrew Bogott: test_cookbook.py: format with black -l 100 -t py39 [puppet] - 10https://gerrit.wikimedia.org/r/1290867 (owner: 10Andrew Bogott) [19:00:03] RECOVERY - Host ps1-d6-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.29 ms [19:00:04] (03CR) 10JHathaway: "@Ladsgroup@gmail.com happy to deploy this one, unless you would rather?" [puppet] - 10https://gerrit.wikimedia.org/r/1289386 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [19:00:44] FIRING: [12x] ProbeDown: Service restbase1042-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:01:50] (03PS4) 10Andrew Bogott: test_cookbook.py: Allow recording tests on invoked cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/1290858 [19:02:10] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for caro - https://phabricator.wikimedia.org/T426995 (10thcipriani) 03NEW [19:02:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:03:55] RESOLVED: [12x] ProbeDown: Service restbase1042-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:27] PROBLEM - Host lsw1-d3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:08:55] FIRING: [14x] ProbeDown: Service restbase1042-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:59] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [19:10:25] (03CR) 10Ladsgroup: [C:03+1] "go for it. Since I will be out in the next two hours :P (jokes aside, if you want to, you can wait until I'm back)" [puppet] - 10https://gerrit.wikimedia.org/r/1289386 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [19:10:44] FIRING: [12x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:11:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11946230 (10Jhancock.wm) actually neither one. found the issue and working on it with papaul [19:12:25] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:55] RESOLVED: [12x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:15:45] (03CR) 10Xcollazo: [V:03+1 C:03+1] "Ah I dropped the ball here, my bad @jamesmhare@gmail.com." [puppet] - 10https://gerrit.wikimedia.org/r/1277254 (owner: 10Harej) [19:16:27] RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.57 ms [19:16:27] RECOVERY - Host lsw1-c2-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms [19:16:29] RECOVERY - Host ps1-a3-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.21 ms [19:16:53] RECOVERY - Host lsw1-a3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.89 ms [19:16:57] PROBLEM - ps1-c2-codfw-infeed-load-tower-B-phase-X on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:16:57] PROBLEM - ps1-c2-codfw-infeed-load-tower-B-phase-Z on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:16:57] PROBLEM - ps1-c2-codfw-infeed-load-tower-A-phase-Z on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:17:25] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:01] PROBLEM - Host lsw1-c2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:18:13] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [19:18:41] RECOVERY - ps1-c2-codfw-infeed-load-tower-B-phase-X on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-B-phase-X 530 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:18:41] RECOVERY - ps1-c2-codfw-infeed-load-tower-A-phase-Z on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-A-phase-Z 513 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:18:41] RECOVERY - ps1-c2-codfw-infeed-load-tower-B-phase-Z on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-B-phase-Z 576 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:18:43] RECOVERY - Host cloudsw1-b1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.02 ms [19:18:43] RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.78 ms [19:18:49] RECOVERY - Host ps1-b1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [19:20:23] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [19:20:23] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.13 ms [19:20:33] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [19:20:44] FIRING: [18x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:21:11] RECOVERY - Host ps1-c2-codfw is UP: PING WARNING - Packet loss = 33%, RTA = 34.12 ms [19:21:13] RECOVERY - Host lsw1-c2-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms [19:21:15] RECOVERY - Host ps1-b8-codfw is UP: PING WARNING - Packet loss = 33%, RTA = 33.45 ms [19:21:23] RECOVERY - Host lsw1-b8-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [19:21:39] PROBLEM - ps1-c2-codfw-infeed-load-tower-A-phase-X on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:22:06] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:restbase [19:23:21] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.54 ms [19:23:43] RECOVERY - Host lsw1-c1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.16 ms [19:23:55] RESOLVED: [12x] ProbeDown: Service restbase1044-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:25] RECOVERY - ps1-c2-codfw-infeed-load-tower-A-phase-X on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-A-phase-X 597 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:24:57] PROBLEM - ps1-c2-codfw-infeed-load-tower-A-phase-Z on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:24:57] PROBLEM - ps1-c2-codfw-infeed-load-tower-B-phase-X on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:24:57] PROBLEM - ps1-c2-codfw-infeed-load-tower-B-phase-Z on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:26:01] RECOVERY - ps1-c2-codfw-infeed-load-tower-A-phase-Z on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-A-phase-Z 527 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:26:01] RECOVERY - ps1-c2-codfw-infeed-load-tower-B-phase-Z on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-B-phase-Z 563 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:26:01] RECOVERY - ps1-c2-codfw-infeed-load-tower-B-phase-X on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-B-phase-X 526 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:27:21] RECOVERY - Host scs-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms [19:33:51] FIRING: NetworkDeviceAlarmActive: Alarm active on lsw1-c1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-c1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:35:32] (03PS1) 10Clare Ming: Update api_url for growthbook on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290889 [19:40:41] RECOVERY - Host ps1-d3-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.31 ms [19:40:41] RECOVERY - Host lsw1-d3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.14 ms [19:40:51] RECOVERY - Host ps1-e3-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.65 ms [19:40:51] RECOVERY - Host ps1-e4-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.40 ms [19:40:51] RECOVERY - Host ps1-e5-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.04 ms [19:40:51] PROBLEM - ps1-e3-codfw-infeed-load-tower-A-phase-Y on ps1-e3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:51] PROBLEM - ps1-e3-codfw-infeed-load-tower-A-phase-X on ps1-e3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:53] RECOVERY - Host ps1-f3-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [19:40:53] RECOVERY - Host ps1-e1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [19:40:53] RECOVERY - Host ps1-f1-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.20 ms [19:40:53] RECOVERY - Host ps1-f2-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.45 ms [19:40:55] RECOVERY - Host msw2-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [19:40:55] RECOVERY - Host ps1-f5-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.11 ms [19:40:57] PROBLEM - ps1-e1-codfw-infeed-load-tower-A-phase-Y on ps1-e1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:57] PROBLEM - ps1-e1-codfw-infeed-load-tower-A-phase-X on ps1-e1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:57] PROBLEM - ps1-e1-codfw-infeed-load-tower-B-phase-Y on ps1-e1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:57] PROBLEM - ps1-f1-codfw-infeed-load-tower-B-phase-Z on ps1-f1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:57] RECOVERY - ps1-e3-codfw-infeed-load-tower-A-phase-X on ps1-e3-codfw is OK: SNMP OK - ps1-e3-codfw-infeed-load-tower-A-phase-X 151 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:57] RECOVERY - ps1-e3-codfw-infeed-load-tower-A-phase-Y on ps1-e3-codfw is OK: SNMP OK - ps1-e3-codfw-infeed-load-tower-A-phase-Y 179 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:41:01] RECOVERY - ps1-e1-codfw-infeed-load-tower-A-phase-Y on ps1-e1-codfw is OK: SNMP OK - ps1-e1-codfw-infeed-load-tower-A-phase-Y 295 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:41:01] RECOVERY - ps1-e1-codfw-infeed-load-tower-B-phase-Y on ps1-e1-codfw is OK: SNMP OK - ps1-e1-codfw-infeed-load-tower-B-phase-Y 229 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:41:01] RECOVERY - ps1-f1-codfw-infeed-load-tower-B-phase-Z on ps1-f1-codfw is OK: SNMP OK - ps1-f1-codfw-infeed-load-tower-B-phase-Z 297 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:41:01] RECOVERY - ps1-e1-codfw-infeed-load-tower-A-phase-X on ps1-e1-codfw is OK: SNMP OK - ps1-e1-codfw-infeed-load-tower-A-phase-X 270 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:41:07] PROBLEM - Host lsw1-e2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:41:07] PROBLEM - Host lsw1-f4-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:45:57] RECOVERY - Host ps1-e2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.54 ms [19:46:25] RECOVERY - Host lsw1-e2-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.23 ms [19:46:44] (03PS2) 10JHathaway: profile::postfix::mx: Mark the SMTP port as intentionally open [puppet] - 10https://gerrit.wikimedia.org/r/1283043 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [19:47:57] (03CR) 10JHathaway: profile::postfix::mx: Mark the SMTP port as intentionally open (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283043 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [19:49:35] RECOVERY - Host ps1-f4-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.41 ms [19:50:42] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283043 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [19:50:57] RECOVERY - Host lsw1-f4-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.91 ms [19:57:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T2000). [20:00:05] Krinkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:16] o/ [20:02:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:25] RESOLVED: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:39] (03PS2) 10Clare Ming: Remove growthbook config on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290889 [20:12:50] (03CR) 10Santiago Faci: [C:03+2] Remove growthbook config on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290889 (owner: 10Clare Ming) [20:14:47] (03Merged) 10jenkins-bot: Remove growthbook config on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290889 (owner: 10Clare Ming) [20:15:34] (03CR) 10Bking: [C:03+2] Change IP address for Scatter mirror [puppet] - 10https://gerrit.wikimedia.org/r/1277254 (owner: 10Harej) [20:16:01] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [20:26:09] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [20:44:37] (03PS1) 10Lerickson: Remove airflow-wikidata S3 credentials in "connections" and "extra_secrets". [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290915 (https://phabricator.wikimedia.org/T426764) [20:45:29] (03PS2) 10Lerickson: Remove airflow-wikidata S3 credentials in "connections" and "extra_secrets". [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290915 (https://phabricator.wikimedia.org/T426764) [20:46:33] (03PS1) 10Santiago Faci: test-kitchen chart: Updated to support growthbook specific configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290916 (https://phabricator.wikimedia.org/T426110) [20:48:11] (03PS2) 10Santiago Faci: test-kitchen chart: Updated to support growthbook specific configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290916 (https://phabricator.wikimedia.org/T426110) [20:51:54] (03PS3) 10Santiago Faci: test-kitchen chart: Updated to support growthbook specific configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290916 (https://phabricator.wikimedia.org/T426110) [20:52:41] (03PS4) 10Santiago Faci: test-kitchen chart: Updated to support growthbook specific configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290916 (https://phabricator.wikimedia.org/T426110) [20:58:16] (03PS5) 10Santiago Faci: test-kitchen chart: Updated to support growthbook specific configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290916 (https://phabricator.wikimedia.org/T426110) [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T2100) [21:02:56] (03PS1) 10Aude: Make logging of title and page ID optional [extensions/QuickSurveys] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290924 (https://phabricator.wikimedia.org/T426457) [21:05:15] (03PS6) 10Santiago Faci: test-kitchen chart: Updated to support growthbook specific configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290916 (https://phabricator.wikimedia.org/T426110) [21:06:10] (03PS1) 10Aude: Re-enable ReadingLists QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290926 (https://phabricator.wikimedia.org/T426781) [21:06:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/QuickSurveys] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290924 (https://phabricator.wikimedia.org/T426457) (owner: 10Aude) [21:07:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290926 (https://phabricator.wikimedia.org/T426781) (owner: 10Aude) [21:09:29] (03PS7) 10Santiago Faci: test-kitchen chart: Updated to support growthbook specific configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290916 (https://phabricator.wikimedia.org/T426110) [21:10:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:13:57] (03PS8) 10Santiago Faci: test-kitchen chart: Updated to support growthbook specific configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290916 (https://phabricator.wikimedia.org/T426110) [21:14:15] 10ops-eqiad, 06SRE, 06DC-Ops: DCops card swap - https://phabricator.wikimedia.org/T427004 (10VRiley-WMF) 03NEW [21:14:40] 10ops-eqiad, 06SRE, 06DC-Ops: DCops card swap - https://phabricator.wikimedia.org/T427004#11946491 (10VRiley-WMF) 05Open→03Resolved This has been completed [21:15:12] 06SRE, 06Traffic, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11946494 (10VRiley-WMF) [21:16:38] (03CR) 10Clare Ming: [C:03+2] test-kitchen chart: Updated to support growthbook specific configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290916 (https://phabricator.wikimedia.org/T426110) (owner: 10Santiago Faci) [21:18:51] (03Merged) 10jenkins-bot: test-kitchen chart: Updated to support growthbook specific configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290916 (https://phabricator.wikimedia.org/T426110) (owner: 10Santiago Faci) [21:19:45] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:20:09] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:21:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11946501 (10VRiley-WMF) a:05BTullis→03VRiley-WMF [21:25:17] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:25:49] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:31:15] (03CR) 10Ladsgroup: [C:03+1] "I'm back. If you want to push it." [puppet] - 10https://gerrit.wikimedia.org/r/1289386 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [21:31:51] (03CR) 10JHathaway: [C:03+2] "will do..." [puppet] - 10https://gerrit.wikimedia.org/r/1289386 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [21:38:29] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.3.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290932 [21:41:07] PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100% [21:41:50] (03PS2) 10Clare Ming: Test Kitchen UI: Deploy v1.3.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290932 (https://phabricator.wikimedia.org/T397016) [21:42:35] RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [21:44:04] (03CR) 10Jdlrobson: [C:03+1] Make logging of title and page ID optional [extensions/QuickSurveys] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290924 (https://phabricator.wikimedia.org/T426457) (owner: 10Aude) [21:44:07] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host zuul2002.codfw.wmnet with OS trixie [21:45:06] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.3.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290932 (https://phabricator.wikimedia.org/T397016) (owner: 10Clare Ming) [21:47:19] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290932 (https://phabricator.wikimedia.org/T397016) (owner: 10Clare Ming) [21:49:19] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [21:49:31] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [22:02:17] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [22:03:21] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul2002.codfw.wmnet with reason: host reimage [22:08:53] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2002.codfw.wmnet with reason: host reimage [22:22:36] (03CR) 10Acamicamacaraca: Gender namespaces on Serbo-Croatian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285467 (https://phabricator.wikimedia.org/T425402) (owner: 10Acamicamacaraca) [22:26:38] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul2002.codfw.wmnet with OS trixie [22:34:16] (03Abandoned) 10Ladsgroup: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1277601 (https://phabricator.wikimedia.org/T424551) (owner: 10Gerrit maintenance bot) [22:35:59] (03Abandoned) 10Ladsgroup: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1122608 (https://phabricator.wikimedia.org/T387224) (owner: 10Gerrit maintenance bot) [22:36:12] (03Abandoned) 10Ladsgroup: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1152023 (https://phabricator.wikimedia.org/T395544) (owner: 10Gerrit maintenance bot) [22:36:27] (03Abandoned) 10Ladsgroup: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1169725 (https://phabricator.wikimedia.org/T399619) (owner: 10Gerrit maintenance bot) [22:37:03] (03Abandoned) 10Ladsgroup: wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1189939 (https://phabricator.wikimedia.org/T399891) (owner: 10Gerrit maintenance bot) [22:37:26] (03Abandoned) 10Ladsgroup: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1113105 (https://phabricator.wikimedia.org/T384287) (owner: 10Gerrit maintenance bot) [22:38:36] (03Abandoned) 10Ladsgroup: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1113098 (https://phabricator.wikimedia.org/T384284) (owner: 10Gerrit maintenance bot) [22:38:42] (03Abandoned) 10Ladsgroup: mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1113104 (https://phabricator.wikimedia.org/T384287) (owner: 10Gerrit maintenance bot) [22:38:50] (03Abandoned) 10Ladsgroup: mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1117181 (https://phabricator.wikimedia.org/T385576) (owner: 10Gerrit maintenance bot) [22:38:56] (03Abandoned) 10Ladsgroup: mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1152022 (https://phabricator.wikimedia.org/T395544) (owner: 10Gerrit maintenance bot) [22:39:02] (03Abandoned) 10Ladsgroup: mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1152024 (https://phabricator.wikimedia.org/T395545) (owner: 10Gerrit maintenance bot) [22:39:10] (03Abandoned) 10Ladsgroup: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1169723 (https://phabricator.wikimedia.org/T399619) (owner: 10Gerrit maintenance bot) [22:39:15] (03Abandoned) 10Ladsgroup: mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1169812 (https://phabricator.wikimedia.org/T399680) (owner: 10Gerrit maintenance bot) [22:39:21] (03Abandoned) 10Ladsgroup: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1180099 (https://phabricator.wikimedia.org/T402275) (owner: 10Gerrit maintenance bot) [22:42:10] jouncebot: nowandnext [22:42:10] No deployments scheduled for the next 7 hour(s) and 17 minute(s) [22:42:10] In 7 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260522T0600) [22:43:07] (03PS1) 10Dreamy Jazz: Drop wgEnablePartialActionBlocks as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290952 [22:43:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290952 (owner: 10Dreamy Jazz) [22:44:30] (03Merged) 10jenkins-bot: Drop wgEnablePartialActionBlocks as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290952 (owner: 10Dreamy Jazz) [22:51:01] (03PS1) 10Dreamy Jazz: Drop not defined config $wgAllowRawHtmlCopyrightMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290954 [22:56:00] (03PS1) 10Dreamy Jazz: Drop undefined wgGENotificationsTrackingEnabled config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290956 [22:56:10] (03PS2) 10Dreamy Jazz: Drop undefined wgGENotificationsTrackingEnabled config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290956 [22:56:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290956 (owner: 10Dreamy Jazz) [22:57:49] (03Merged) 10jenkins-bot: Drop undefined wgGENotificationsTrackingEnabled config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290956 (owner: 10Dreamy Jazz) [23:04:34] (03PS1) 10Dreamy Jazz: Drop $wgGraphShowInToolbar definition as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290957 [23:09:59] (03PS1) 10Dreamy Jazz: Drop wgMFSearchGenerator definition as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290958 [23:13:04] (03PS1) 10Dreamy Jazz: Drop unused wpReportIncidentLocalLinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290960 [23:15:48] (03PS1) 10Dreamy Jazz: Remove unused wgReportIncidentUseV2NonEmergencyFlow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290961 [23:16:45] (03PS2) 10Dreamy Jazz: Remove unused wgReportIncidentUseV2NonEmergencyFlow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290961 [23:16:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290961 (owner: 10Dreamy Jazz) [23:18:14] (03PS2) 10Dreamy Jazz: Drop $wgGraphShowInToolbar definition as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290957 [23:18:22] (03Merged) 10jenkins-bot: Remove unused wgReportIncidentUseV2NonEmergencyFlow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290961 (owner: 10Dreamy Jazz) [23:19:06] (03PS2) 10Dreamy Jazz: Drop wgMFSearchGenerator definition as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290958 [23:19:11] (03PS2) 10Dreamy Jazz: Drop unused wpReportIncidentLocalLinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290960 [23:19:17] (03PS2) 10Dreamy Jazz: Drop not defined config $wgAllowRawHtmlCopyrightMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290954 [23:19:17] (03PS3) 10Dreamy Jazz: Drop $wgGraphShowInToolbar definition as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290957 [23:19:17] (03PS3) 10Dreamy Jazz: Drop wgMFSearchGenerator definition as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290958 [23:19:17] (03PS3) 10Dreamy Jazz: Drop unused wpReportIncidentLocalLinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290960 [23:19:39] (03CR) 10CI reject: [V:04-1] Drop wgMFSearchGenerator definition as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290958 (owner: 10Dreamy Jazz) [23:19:40] (03CR) 10CI reject: [V:04-1] Drop unused wpReportIncidentLocalLinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290960 (owner: 10Dreamy Jazz) [23:26:23] 06SRE, 10Wikimedia-Mailing-lists: New mailing list for the latam tech community - https://phabricator.wikimedia.org/T426803#11946963 (10Arcstur) In our telegram group we landed on wikitec-latam and we were happy with it (no new discussions on it as of now). As per the standardization, it would fall under the l... [23:29:52] (03PS1) 10Dreamy Jazz: hCaptcha CommonSettings.php: Don't define sitekeys as config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 [23:31:40] (03PS2) 10Dreamy Jazz: hCaptcha CommonSettings.php: Don't define sitekeys as config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 [23:33:04] (03PS3) 10Dreamy Jazz: hCaptcha CommonSettings.php: Don't define sitekeys as config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 [23:33:51] FIRING: NetworkDeviceAlarmActive: Alarm active on lsw1-c1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-c1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:34:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290954 (owner: 10Dreamy Jazz) [23:34:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290957 (owner: 10Dreamy Jazz) [23:34:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290958 (owner: 10Dreamy Jazz) [23:34:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290960 (owner: 10Dreamy Jazz) [23:35:55] (03Merged) 10jenkins-bot: Drop not defined config $wgAllowRawHtmlCopyrightMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290954 (owner: 10Dreamy Jazz) [23:35:58] (03Merged) 10jenkins-bot: Drop $wgGraphShowInToolbar definition as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290957 (owner: 10Dreamy Jazz) [23:36:03] (03Merged) 10jenkins-bot: Drop wgMFSearchGenerator definition as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290958 (owner: 10Dreamy Jazz) [23:36:07] (03Merged) 10jenkins-bot: Drop unused wpReportIncidentLocalLinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290960 (owner: 10Dreamy Jazz) [23:36:25] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1290954|Drop not defined config $wgAllowRawHtmlCopyrightMessages]], [[gerrit:1290957|Drop $wgGraphShowInToolbar definition as unused]], [[gerrit:1290958|Drop wgMFSearchGenerator definition as unused]], [[gerrit:1290960|Drop unused wpReportIncidentLocalLinks]] [23:38:08] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1290954|Drop not defined config $wgAllowRawHtmlCopyrightMessages]], [[gerrit:1290957|Drop $wgGraphShowInToolbar definition as unused]], [[gerrit:1290958|Drop wgMFSearchGenerator definition as unused]], [[gerrit:1290960|Drop unused wpReportIncidentLocalLinks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified [23:38:08] there. [23:38:57] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [23:39:03] jouncebot: nowandnext [23:39:03] No deployments scheduled for the next 6 hour(s) and 20 minute(s) [23:39:04] In 6 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260522T0600) [23:40:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1290973 [23:40:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1290973 (owner: 10TrainBranchBot) [23:43:08] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290954|Drop not defined config $wgAllowRawHtmlCopyrightMessages]], [[gerrit:1290957|Drop $wgGraphShowInToolbar definition as unused]], [[gerrit:1290958|Drop wgMFSearchGenerator definition as unused]], [[gerrit:1290960|Drop unused wpReportIncidentLocalLinks]] (duration: 06m 42s) [23:52:28] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1290973 (owner: 10TrainBranchBot)