[00:17:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:03] RECOVERY - dump of db_inventory in codfw on backupmon1001 is OK: Last dump for db_inventory at codfw (db2185) taken on 2026-05-26 00:39:35 (1.3 MiB, +0.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:59:03] RECOVERY - dump of db_inventory in eqiad on backupmon1001 is OK: Last dump for db_inventory at eqiad (db1215) taken on 2026-05-26 00:40:30 (1.3 MiB, +0.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:08:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.47.0-wmf.4 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1293185 (https://phabricator.wikimedia.org/T423913) [01:09:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.47.0-wmf.4 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1293185 (https://phabricator.wikimedia.org/T423913) (owner: 10TrainBranchBot) [01:09:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1293186 [01:09:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1293186 (owner: 10TrainBranchBot) [01:23:38] (03Merged) 10jenkins-bot: Branch commit for wmf/1.47.0-wmf.4 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1293185 (https://phabricator.wikimedia.org/T423913) (owner: 10TrainBranchBot) [01:23:45] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1293186 (owner: 10TrainBranchBot) [01:29:57] FIRING: ProbeDown: Service upload:80 has failed probes (http_upload_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#upload:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:57] RESOLVED: ProbeDown: Service upload:80 has failed probes (http_upload_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#upload:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:45:57] FIRING: ProbeDown: Service upload:80 has failed probes (http_upload_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:46:43] o/ [01:46:52] !incidents [01:46:53] 8016 (UNACKED) ProbeDown sre (2620:0:861:ed1a::2:b ip6 upload:80 probes/service http_upload_ip6 eqiad) [01:46:53] 8015 (RESOLVED) Icinga meta-monitoring check is still DOWN [01:47:01] !ack 8016 [01:47:01] 8016 (ACKED) ProbeDown sre (2620:0:861:ed1a::2:b ip6 upload:80 probes/service http_upload_ip6 eqiad) [01:48:17] seems to be recovering, after a similar blip around 1:28 [01:50:57] RESOLVED: ProbeDown: Service upload:80 has failed probes (http_upload_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T0200) [02:01:03] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:23] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 20s) [02:09:14] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:11] FIRING: Temperature: Temp issue on wdqs1023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [02:21:11] RESOLVED: Temperature: Temp issue on wdqs1023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [02:34:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T0300) [03:01:54] (03PS1) 10TrainBranchBot: testwikis to 1.47.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293188 (https://phabricator.wikimedia.org/T423913) [03:01:57] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293188 (https://phabricator.wikimedia.org/T423913) (owner: 10TrainBranchBot) [03:02:51] (03Merged) 10jenkins-bot: testwikis to 1.47.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293188 (https://phabricator.wikimedia.org/T423913) (owner: 10TrainBranchBot) [03:03:19] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.47.0-wmf.4 refs T423913 [03:03:24] T423913: 1.47.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T423913 [03:04:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:25:47] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Decommision old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427246 (10Papaul) 03NEW [03:39:43] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.47.0-wmf.4 refs T423913 (duration: 36m 24s) [03:39:48] T423913: 1.47.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T423913 [03:40:47] PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100% [03:43:15] RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [03:44:51] FIRING: CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, ... [03:44:51] MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://grafana.wikimedia.org/d/5p97dAASz/queue-and-error-stats-by-network-device?var-site=eqsin+prometheus%2Fops&var-device=cr3-eqsin&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent [04:02:40] !log mwpresync@deploy1003 Pruned MediaWiki: 1.47.0-wmf.1 (duration: 02m 32s) [04:13:53] FIRING: DDoSDetected: FastNetMon has detected an attack on eqsin #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [04:14:25] !ack [04:14:26] 8017 (ACKED) DDoSDetected sre (netflow5003:9100 eqsin) [04:14:39] Hm let my start my computer [04:14:53] Yeh going to mine too [04:17:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:28:53] RESOLVED: DDoSDetected: FastNetMon has detected an attack on eqsin #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [04:34:16] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Decommision old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427246#11953658 (10Papaul) [04:34:30] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [04:37:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:47:41] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Decommision old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427246#11953661 (10Papaul) [04:49:42] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Decommision old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427246#11953662 (10Papaul) 05Open→03Resolved Both switches are now set to offline. The only step left is for onsite to remove all the cable... [04:51:51] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11953665 (10Papaul) Email back from Nokia team ` The target release is still being considered. I’ll let you know once we have more information. ` [04:59:51] RESOLVED: CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, ... [04:59:51] MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://grafana.wikimedia.org/d/5p97dAASz/queue-and-error-stats-by-network-device?var-site=eqsin+prometheus%2Fops&var-device=cr3-eqsin&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent [05:12:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2024.codfw.wmnet,pc[1014,1024].eqiad.wmnet with reason: Maintenance on pc4 [05:15:38] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1014.eqiad.wmnet: Maintenance on pc4 [05:15:40] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:15:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:15:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1014.eqiad.wmnet: Maintenance on pc4 [05:19:06] (03PS1) 10Marostegui: pc1024: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1293191 (https://phabricator.wikimedia.org/T418973) [05:19:38] (03CR) 10CI reject: [V:04-1] pc1024: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1293191 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [05:19:55] (03PS2) 10Marostegui: pc1024: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1293191 (https://phabricator.wikimedia.org/T418973) [05:21:21] (03CR) 10Marostegui: [C:03+2] pc1024: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1293191 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [05:47:44] (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T0600). [06:00:50] federico3: let's run the RO cookbook [06:00:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.global-read-only [06:01:07] marostegui: s2 is RO [06:01:10] checking [06:01:14] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.global-read-only (exit_code=0) [06:01:24] federico3: yep, it works [06:01:32] I see it in logstash [06:01:48] Rows reads seems to be dropping in grafana [06:01:51] federico3: for later, the script should do a !log setting $section in RO (or setting all sections in RO) [06:02:04] ack [06:02:15] federico3: it is definitely RO, yep, so it works [06:02:17] you can disable it [06:02:25] going RW now [06:02:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.global-read-only [06:02:59] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.global-read-only (exit_code=0) [06:03:09] it's RW now [06:03:13] I can write yes [06:03:19] So it works, let's add the !log thing [06:03:29] federico3: you can proceed with the s2 normal switchover [06:03:32] ok I'l continue the run doing the s2 switchover [06:04:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T425622 [06:04:25] T425622: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T425622 [06:04:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db1162 with weight 0 T425622', diff saved to https://phabricator.wikimedia.org/P92907 and previous config saved to /var/cache/conftool/dbconfig/20260526-060443-fceratto.json [06:08:34] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1284559 (https://phabricator.wikimedia.org/T425622) (owner: 10Gerrit maintenance bot) [06:10:03] !log Starting s2 eqiad failover from db1222 to db1162 - T425622 [06:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:07] T425622: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T425622 [06:10:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T425622', diff saved to https://phabricator.wikimedia.org/P92908 and previous config saved to /var/cache/conftool/dbconfig/20260526-061021-fceratto.json [06:10:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:11:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db1162 to s2 primary and set section read-write T425622', diff saved to https://phabricator.wikimedia.org/P92909 and previous config saved to /var/cache/conftool/dbconfig/20260526-061114-fceratto.json [06:12:55] (03CR) 10Federico Ceratto: [C:03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1284560 (https://phabricator.wikimedia.org/T425622) (owner: 10Gerrit maintenance bot) [06:14:16] !log fceratto@dns1005 START - running authdns-update [06:15:58] !log fceratto@dns1005 END - running authdns-update [06:16:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool db1222 T425622', diff saved to https://phabricator.wikimedia.org/P92910 and previous config saved to /var/cache/conftool/dbconfig/20260526-061656-fceratto.json [06:17:01] T425622: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T425622 [06:17:23] (03PS1) 10Marostegui: installserver: Remove pc1021 from /srv formatting [puppet] - 10https://gerrit.wikimedia.org/r/1293583 [06:19:20] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [06:20:31] (03CR) 10Hashar: "@abran@wikimedia.org it is a quite old change and was buried at the bottom of my Gerrit dashboard ;) May you Puppet merge this one please," [puppet] - 10https://gerrit.wikimedia.org/r/1193832 (owner: 10Hashar) [06:23:20] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1222: Switchover [06:24:55] (03CR) 10Marostegui: [C:03+2] installserver: Remove pc1021 from /srv formatting [puppet] - 10https://gerrit.wikimedia.org/r/1293583 (owner: 10Marostegui) [06:26:20] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [06:28:53] !log fceratto@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [06:29:20] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [06:31:20] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [06:31:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [06:31:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2169 (T426633)', diff saved to https://phabricator.wikimedia.org/P92912 and previous config saved to /var/cache/conftool/dbconfig/20260526-063155-fceratto.json [06:34:20] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [06:35:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast6003.wikimedia.org [06:36:20] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [06:36:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:38:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T426633)', diff saved to https://phabricator.wikimedia.org/P92914 and previous config saved to /var/cache/conftool/dbconfig/20260526-063853-fceratto.json [06:39:20] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [06:41:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast6003.wikimedia.org [06:47:52] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1222: Switchover [06:48:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1222.eqiad.wmnet with reason: Maintenance [06:48:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1222 (T419635)', diff saved to https://phabricator.wikimedia.org/P92916 and previous config saved to /var/cache/conftool/dbconfig/20260526-064833-fceratto.json [06:48:38] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [06:49:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P92917 and previous config saved to /var/cache/conftool/dbconfig/20260526-064901-fceratto.json [06:50:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org [06:53:28] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1045.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [06:53:28] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:53:29] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1045.eqiad.wmnet [06:53:42] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1046.eqiad.wmnet [06:55:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet [06:57:20] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [06:58:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1048.eqiad.wmnet [06:58:42] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [06:59:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org [06:59:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P92918 and previous config saved to /var/cache/conftool/dbconfig/20260526-065909-fceratto.json [07:00:05] Amir1, urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T0700). [07:00:05] codders: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] o/ [07:00:46] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [07:01:06] PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [07:01:12] (03CR) 10Marostegui: "As discussed during today's test, we have to make sure we do a !log on wikimedia-operations for when we go RO on all sections or on a sect" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [07:02:35] FIRING: DiskSpace: Disk space krb1002:9100:/ 1.639% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:02:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1003.wikimedia.org [07:03:53] hmm. quiet in here [07:03:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1048.eqiad.wmnet [07:04:04] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1046.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [07:04:23] PROBLEM - MariaDB read only s2 #page on db2204 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.11.16-MariaDB-log, Uptime 2311610s, event_scheduler: True, 184.67 QPS, connection latency: 0.020522s, query latency: 0.000833s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:04:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1048.eqiad.wmnet [07:04:34] federico3: ^ [07:04:35] !ack [07:04:36] 8018 (ACKED) db2204 (paged)/MariaDB read only s2 (paged) [07:04:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T419635)', diff saved to https://phabricator.wikimedia.org/P92919 and previous config saved to /var/cache/conftool/dbconfig/20260526-070436-fceratto.json [07:04:41] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:04:59] anyone around to keep half an eye while I deploy a patch? [07:05:14] federico3: why the old master has slaves? [07:05:34] sorry, I checked codfw [07:05:36] RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 8.70 ms [07:05:48] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [07:05:50] federico3: We left the codfw master with RW [07:05:56] federico3: the script needs fixing [07:06:10] I just fixed it [07:06:14] RW on dbctl or on mariadb? [07:06:23] RECOVERY - MariaDB read only s2 #page on db2204 is OK: Version 10.11.16-MariaDB-log, Uptime 2311730s, read_only: True, event_scheduler: True, 120.88 QPS, connection latency: 0.025349s, query latency: 0.001037s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:06:27] afaik it's meant to be RW on dbctl [07:06:28] federico3: mariadb [07:06:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1003.wikimedia.org [07:06:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1293147 (https://phabricator.wikimedia.org/T385798) (owner: 10Hnowlan) [07:06:45] thank you for fixing the issue so quickly [07:06:50] jelto: sorry for the page [07:06:59] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1046.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [07:06:59] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:07:00] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1046.eqiad.wmnet [07:07:13] federico3: the script should never change the intermediate master to RW [07:07:34] RESOLVED: DiskSpace: Disk space krb1002:9100:/ 1.321% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:07:48] (03CR) 10Marostegui: "We just got a p4ge for the codfw master being RW. The script should never change the intermediate master mariadb config." [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [07:09:09] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1047.eqiad.wmnet [07:09:10] !log failover Ganeti master in eqiad to ganeti1048 [07:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T426633)', diff saved to https://phabricator.wikimedia.org/P92920 and previous config saved to /var/cache/conftool/dbconfig/20260526-070916-fceratto.json [07:09:38] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [07:09:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2180 (T426633)', diff saved to https://phabricator.wikimedia.org/P92921 and previous config saved to /var/cache/conftool/dbconfig/20260526-070946-fceratto.json [07:09:52] Hi, does anyone know why PatchDemo isn't working? [07:09:55] or where should I report this? [07:09:57] "The Catalyst API backend is unreachable, demo creation disabled" [07:10:04] https://patchdemo.wmcloud.org/ [07:10:32] urbanecm: are you around? [07:10:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [07:10:56] PROBLEM - ganeti-wconfd running on ganeti1046 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [07:11:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11953777 (10ops-monitoring-bot) Draining ganeti1025.eqiad.wmnet of running VMs [07:11:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [07:13:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [07:13:49] (03PS1) 10Brouberol: Revert "idp/idp_test: temporarily rollback growthbook(-next) access to nda/wmf" [puppet] - 10https://gerrit.wikimedia.org/r/1293585 [07:14:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P92922 and previous config saved to /var/cache/conftool/dbconfig/20260526-071444-fceratto.json [07:15:42] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1026.eqiad.wmnet [07:15:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [07:16:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T426633)', diff saved to https://phabricator.wikimedia.org/P92923 and previous config saved to /var/cache/conftool/dbconfig/20260526-071635-fceratto.json [07:17:17] Neriah: probably a Phabricator task is best [07:17:49] codders: you're probably best asking in -sre or -releng if you're looking for a deployed as this channel is very noisy and it will be easily missed [07:18:02] will do - thanks! [07:18:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1047.eqiad.wmnet [07:18:52] RhinosF1: I did it (T427248). Should I tag it somehow? [07:18:52] T427248: PatchDemo: Catalyst API backend unreachable - https://phabricator.wikimedia.org/T427248 [07:18:52] thanks :) [07:18:59] Looking [07:19:27] Neriah: looks good to me [07:23:50] starting with the deploy of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1291951 [07:23:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1047.eqiad.wmnet [07:23:51] (03CR) 10Hashar: [C:03+1] "Looks good! You can proceed :-) Than you for the notification!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291951 (https://phabricator.wikimedia.org/T424329) (owner: 10Arthur taylor) [07:24:08] codders: +1 ed [07:24:16] you can self deploy can you? [07:24:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet [07:24:21] hashar: yup [07:24:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arthurtaylor@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291951 (https://phabricator.wikimedia.org/T424329) (owner: 10Arthur taylor) [07:24:38] <3 [07:24:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P92924 and previous config saved to /var/cache/conftool/dbconfig/20260526-072452-fceratto.json [07:25:39] (03Merged) 10jenkins-bot: Enable and configure WikiProjects prototype on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291951 (https://phabricator.wikimedia.org/T424329) (owner: 10Arthur taylor) [07:25:44] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [07:26:45] !log arthurtaylor@deploy1003 Started scap sync-world: Backport for [[gerrit:1291951|Enable and configure WikiProjects prototype on Test Wikidata (T424329)]] [07:26:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1046.eqiad.wmnet [07:26:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260526-072643-fceratto.json [07:26:52] T424329: [WIPR] Prototype - Display Wikiproject link on Test Wikidata Item pages using properties - https://phabricator.wikimedia.org/T424329 [07:30:26] (03PS1) 10Muehlenhoff: Failover irc.w.o to irc1003 [dns] - 10https://gerrit.wikimedia.org/r/1293586 [07:30:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1046.eqiad.wmnet [07:31:03] !log arthurtaylor@deploy1003 arthurtaylor: Backport for [[gerrit:1291951|Enable and configure WikiProjects prototype on Test Wikidata (T424329)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:31:24] jiji@cumin1003 decommission (PID 468555) is awaiting input [07:31:24] testing the patch... [07:32:15] looks good - proceeding [07:32:32] !log arthurtaylor@deploy1003 arthurtaylor: Continuing with deployment [07:35:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T419635)', diff saved to https://phabricator.wikimedia.org/P92926 and previous config saved to /var/cache/conftool/dbconfig/20260526-073459-fceratto.json [07:35:02] !log start rebooting magru liberica instances (T426563) [07:35:05] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:36:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:36:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1222: Upgrading db1222.eqiad.wmnet [07:36:57] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7003.magru.wmnet} and A:liberica [07:37:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1222: Upgrading db1222.eqiad.wmnet [07:37:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P92928 and previous config saved to /var/cache/conftool/dbconfig/20260526-073702-fceratto.json [07:38:24] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1047.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [07:38:30] (03CR) 10Mszwarc: Enforce 2FA requirements for phase 3 groups (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [07:38:46] !log arthurtaylor@deploy1003 Finished scap sync-world: Backport for [[gerrit:1291951|Enable and configure WikiProjects prototype on Test Wikidata (T424329)]] (duration: 12m 01s) [07:38:50] T424329: [WIPR] Prototype - Display Wikiproject link on Test Wikidata Item pages using properties - https://phabricator.wikimedia.org/T424329 [07:39:02] all done. Thanks hashar , RhinosF1 ! [07:39:27] (03PS8) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [07:40:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1046.eqiad.wmnet [07:40:25] (03CR) 10Federico Ceratto: "I sent an update to only set the read_only flag in the primary DC" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [07:40:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1046.eqiad.wmnet [07:40:35] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7003.magru.wmnet} and A:liberica [07:41:16] marostegui@cumin1003 major-upgrade (PID 489785) is awaiting input [07:41:28] jiji@cumin1003 decommission (PID 468555) is awaiting input [07:41:39] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7001.magru.wmnet} and A:liberica [07:42:40] (03PS1) 10Bartosz Wójtowicz: ml-services: Add gRPC deployment for outlink topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293591 (https://phabricator.wikimedia.org/T418493) [07:43:46] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [07:43:51] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:43:54] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [07:43:56] (03PS1) 10Slyngshede: data.yaml Off-boarding for Eli Asikin-Garmager [puppet] - 10https://gerrit.wikimedia.org/r/1293592 [07:44:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [07:44:36] (03CR) 10CI reject: [V:04-1] data.yaml Off-boarding for Eli Asikin-Garmager [puppet] - 10https://gerrit.wikimedia.org/r/1293592 (owner: 10Slyngshede) [07:44:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11953837 (10ops-monitoring-bot) Draining ganeti1025.eqiad.wmnet of running VMs [07:45:25] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7001.magru.wmnet} and A:liberica [07:45:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:45:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1222: Upgrading db1222.eqiad.wmnet [07:46:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1222: Upgrading db1222.eqiad.wmnet [07:46:16] (03PS1) 10Elukey: redifish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [07:47:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T426633)', diff saved to https://phabricator.wikimedia.org/P92929 and previous config saved to /var/cache/conftool/dbconfig/20260526-074710-fceratto.json [07:47:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [07:47:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2193 (T426633)', diff saved to https://phabricator.wikimedia.org/P92930 and previous config saved to /var/cache/conftool/dbconfig/20260526-074739-fceratto.json [07:47:44] (03PS1) 10Mszwarc: Allow to remove passkeys when there's only one standard 2FA method [extensions/OATHAuth] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293594 (https://phabricator.wikimedia.org/T426872) [07:48:32] (03CR) 10Dpogorzelski: [C:03+1] ml-services: Add gRPC deployment for outlink topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293591 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [07:48:55] (03PS2) 10Slyngshede: data.yaml Off-boarding for Eli Asikin-Garmager [puppet] - 10https://gerrit.wikimedia.org/r/1293592 [07:49:00] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1023.eqiad.wmnet [07:49:51] marostegui@cumin1003 major-upgrade (PID 496080) is awaiting input [07:49:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11953851 (10AnnieKim_WMDE) Thanks for your help. Sorry about the email confusion -... [07:50:12] codders: congrats, and thank you RhinosF1 [07:51:10] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommision hosts cp2041 - cp2042 - https://phabricator.wikimedia.org/T426828#11953855 (10Fabfur) Thanks! [07:51:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1047.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [07:51:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:51:21] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1047.eqiad.wmnet [07:52:07] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7002.magru.wmnet} and A:liberica [07:52:45] (03PS1) 10Muehlenhoff: profile::rpkivalidator: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1293609 [07:54:01] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Add gRPC deployment for outlink topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293591 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [07:54:24] jiji@cumin1003 decommission (PID 501948) is awaiting input [07:54:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T426633)', diff saved to https://phabricator.wikimedia.org/P92931 and previous config saved to /var/cache/conftool/dbconfig/20260526-075435-fceratto.json [07:54:55] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:56:02] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7002.magru.wmnet} and A:liberica [07:56:26] (03Merged) 10jenkins-bot: ml-services: Add gRPC deployment for outlink topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293591 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [07:56:42] !log start rebooting drmrs liberica instances (T426563) [07:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:59] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6003.drmrs.wmnet} and A:liberica [07:57:35] FIRING: DiskSpace: Disk space krb1002:9100:/ 2.084% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:58:46] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1023.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:59:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [07:59:06] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [07:59:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1023.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:59:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:59:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1023.eqiad.wmnet [07:59:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11953873 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1023.eqiad.wmnet` - ganeti1023.eqiad.wmnet (**PASS**) - Downt... [07:59:32] PROBLEM - MariaDB Replica Lag: s2 #page on db1222 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 997.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [07:59:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:59:42] !ack [07:59:43] 8019 (ACKED) db1222 (paged)/MariaDB Replica Lag: s2 (paged) [07:59:46] Downtime expired [07:59:55] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1024.eqiad.wmnet [07:59:55] okay [07:59:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1222: Upgrading db1222.eqiad.wmnet [08:00:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1222: Upgrading db1222.eqiad.wmnet [08:00:05] jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T0800) [08:00:22] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs6003.drmrs.wmnet} and A:liberica [08:00:37] RECOVERY - MariaDB Replica Lag: s2 #page on db1222 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [08:00:47] hi, train will rollout in a few minutes [08:01:12] (03PS1) 10Marostegui: db1222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293646 (https://phabricator.wikimedia.org/T424615) [08:01:23] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1222.eqiad.wmnet with OS trixie [08:04:04] (03CR) 10Marostegui: [C:03+2] db1222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293646 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [08:04:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P92932 and previous config saved to /var/cache/conftool/dbconfig/20260526-080443-fceratto.json [08:04:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:05:29] (03PS1) 10TrainBranchBot: group0 to 1.47.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293647 (https://phabricator.wikimedia.org/T423913) [08:05:32] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293647 (https://phabricator.wikimedia.org/T423913) (owner: 10TrainBranchBot) [08:06:08] (03PS1) 10KartikMistry: Update Recommendation API to 2026-05-26-074931-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293648 [08:06:50] (03Merged) 10jenkins-bot: group0 to 1.47.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293647 (https://phabricator.wikimedia.org/T423913) (owner: 10TrainBranchBot) [08:07:22] jnuche: fine to deploy patch in the ml-service? [08:07:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:35] RESOLVED: DiskSpace: Disk space krb1002:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:07:41] kart_: please hang on, train is running [08:07:48] OK [08:08:21] (03PS9) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [08:09:07] (03CR) 10Federico Ceratto: "I added the logging into SAL. The logging for Dbctl is optional, the one for MariaDB is always written." [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [08:09:21] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1024.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:09:23] (03PS1) 10Cathal Mooney: Interface validators: allow for channelized port numbers on Juniper [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1293649 (https://phabricator.wikimedia.org/T427056) [08:10:19] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6001.drmrs.wmnet} and A:liberica [08:12:26] jmm@cumin2002 decommission (PID 599297) is awaiting input [08:12:47] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.47.0-wmf.4 refs T423913 [08:12:52] T423913: 1.47.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T423913 [08:13:25] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs6001.drmrs.wmnet} and A:liberica [08:14:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P92934 and previous config saved to /var/cache/conftool/dbconfig/20260526-081451-fceratto.json [08:16:33] (03PS2) 10Cathal Mooney: Interface validators: allow for channelized port numbers on Juniper [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1293649 (https://phabricator.wikimedia.org/T427056) [08:16:45] kart_: train done, feel free to deploy [08:17:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:10] 10SRE-swift-storage, 10Maps: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184#11953930 (10jijiki) [08:18:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1222.eqiad.wmnet with reason: host reimage [08:19:34] 06SRE, 06serviceops-deprecated: Refactor memcached modules - https://phabricator.wikimedia.org/T284454#11953940 (10jijiki) 05Open→03Invalid bluntly closing this [08:19:52] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:20:16] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2195: Upgrading db2195.codfw.wmnet [08:21:06] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2195: Upgrading db2195.codfw.wmnet [08:23:03] (03CR) 10Aklapper: [V:03+2 C:03+2] Log AVA account disabling in the user account management feed [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1293101 (https://phabricator.wikimedia.org/T426972) (owner: 10Aklapper) [08:23:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1222.eqiad.wmnet with reason: host reimage [08:23:44] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2195.codfw.wmnet with OS trixie [08:24:39] (03PS8) 10Effie Mouzeli: role::mediawiki::memcached::wikifunctions: add new role [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) [08:24:43] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) (owner: 10Effie Mouzeli) [08:24:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T426633)', diff saved to https://phabricator.wikimedia.org/P92936 and previous config saved to /var/cache/conftool/dbconfig/20260526-082458-fceratto.json [08:25:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [08:25:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2217 (T426633)', diff saved to https://phabricator.wikimedia.org/P92937 and previous config saved to /var/cache/conftool/dbconfig/20260526-082531-fceratto.json [08:30:34] jouncebot: nowandnext [08:30:34] For the next 1 hour(s) and 29 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T0800) [08:30:34] In 1 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1000) [08:30:37] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6002.drmrs.wmnet} and A:liberica [08:31:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1292032 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [08:31:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 (owner: 10Dreamy Jazz) [08:31:35] jnuche: thanks [08:32:25] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T426633)', diff saved to https://phabricator.wikimedia.org/P92938 and previous config saved to /var/cache/conftool/dbconfig/20260526-083233-fceratto.json [08:32:51] (03Merged) 10jenkins-bot: Grant globalblock-local-status to groups with globalblock-whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1292032 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [08:32:54] (03Merged) 10jenkins-bot: hCaptcha CommonSettings.php: Don't define sitekeys as config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 (owner: 10Dreamy Jazz) [08:33:22] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1292032|Grant globalblock-local-status to groups with globalblock-whitelist (T277942)]], [[gerrit:1290964|hCaptcha CommonSettings.php: Don't define sitekeys as config vars]] [08:33:26] T277942: Address Voice and Tone issues in GlobalBlocking - https://phabricator.wikimedia.org/T277942 [08:33:36] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2026-05-26-074931-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293648 (owner: 10KartikMistry) [08:33:42] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs6002.drmrs.wmnet} and A:liberica [08:34:53] I'll also have one thing to backport [08:35:11] I'll ping you when I'm done [08:35:20] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1292032|Grant globalblock-local-status to groups with globalblock-whitelist (T277942)]], [[gerrit:1290964|hCaptcha CommonSettings.php: Don't define sitekeys as config vars]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:35:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1024.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:35:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:35:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1024.eqiad.wmnet [08:35:25] ack [08:35:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11953997 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1024.eqiad.wmnet` - ganeti1024.eqiad.wmnet (**PASS**) - Downt... [08:35:34] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs5006.eqsin.wmnet} and A:liberica [08:35:41] (03Merged) 10jenkins-bot: Update Recommendation API to 2026-05-26-074931-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293648 (owner: 10KartikMistry) [08:39:02] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs5006.eqsin.wmnet} and A:liberica [08:39:10] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [08:39:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet [08:39:57] !log kartik@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:40:06] !log start rebooting eqsin liberica instances (T426563) [08:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:11] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs5004.eqsin.wmnet} and A:liberica [08:40:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1222.eqiad.wmnet with OS trixie [08:42:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P92939 and previous config saved to /var/cache/conftool/dbconfig/20260526-084240-fceratto.json [08:43:18] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1292032|Grant globalblock-local-status to groups with globalblock-whitelist (T277942)]], [[gerrit:1290964|hCaptcha CommonSettings.php: Don't define sitekeys as config vars]] (duration: 09m 56s) [08:43:22] T277942: Address Voice and Tone issues in GlobalBlocking - https://phabricator.wikimedia.org/T277942 [08:43:24] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2195.codfw.wmnet with reason: host reimage [08:43:32] Msz2001: Over to you [08:43:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet [08:43:42] thanks, deploying [08:43:49] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs5004.eqsin.wmnet} and A:liberica [08:44:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/OATHAuth] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293594 (https://phabricator.wikimedia.org/T426872) (owner: 10Mszwarc) [08:44:27] (03PS1) 10Jelto: gitlab: block old chrome from accessing api endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1293655 (https://phabricator.wikimedia.org/T427199) [08:45:22] (03PS3) 10Blake: Add wikikube-worker refreshes. [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) [08:45:40] (03Merged) 10jenkins-bot: Allow to remove passkeys when there's only one standard 2FA method [extensions/OATHAuth] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293594 (https://phabricator.wikimedia.org/T426872) (owner: 10Mszwarc) [08:46:05] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1293594|Allow to remove passkeys when there's only one standard 2FA method (T426872)]] [08:46:34] (03PS4) 10Blake: Add wikikube-worker refreshes. [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) [08:47:58] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1293594|Allow to remove passkeys when there's only one standard 2FA method (T426872)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:48:52] !log mszwarc@deploy1003 mszwarc: Continuing with deployment [08:49:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1222: Migration of db1222.eqiad.wmnet completed [08:49:57] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2195.codfw.wmnet with reason: host reimage [08:50:00] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:50:06] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs5005.eqsin.wmnet} and A:liberica [08:50:21] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1226: Upgrading db1226.eqiad.wmnet [08:51:17] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#11954062 (10MoritzMuehlenhoff) [08:51:25] !log jmm@cumin2002 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox [08:51:29] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [08:51:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [08:52:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P92941 and previous config saved to /var/cache/conftool/dbconfig/20260526-085248-fceratto.json [08:53:06] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1293592 (owner: 10Slyngshede) [08:53:10] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1226: Upgrading db1226.eqiad.wmnet [08:53:25] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs5005.eqsin.wmnet} and A:liberica [08:53:29] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293594|Allow to remove passkeys when there's only one standard 2FA method (T426872)]] (duration: 07m 23s) [08:53:39] !log start rebooting ulsfo liberica instances (T426563) [08:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1293609 (owner: 10Muehlenhoff) [08:53:55] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs4008.ulsfo.wmnet} and A:liberica [08:54:39] (03PS1) 10Mszwarc: Fix TypeError in Mandatory2FAChecker [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1293658 (https://phabricator.wikimedia.org/T427251) [08:55:42] Deployment done. I'll have one more patch to deploy in a few minutes, but for now I'm freeing deployments, if someone wants to :) [08:55:51] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1226.eqiad.wmnet with OS trixie [08:56:03] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [08:56:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [08:56:48] (03CR) 10Ayounsi: [C:03+1] Interface validators: allow for channelized port numbers on Juniper [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1293649 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [08:57:07] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs4008.ulsfo.wmnet} and A:liberica [08:59:03] (03PS2) 10Blake: site.pp: re-add mc1054 for trixie testing [puppet] - 10https://gerrit.wikimedia.org/r/1293656 [08:59:35] FIRING: DiskSpace: Disk space krb1002:9100:/ 2.009% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:01:21] (03CR) 10Slyngshede: [C:03+2] data.yaml Off-boarding for Eli Asikin-Garmager [puppet] - 10https://gerrit.wikimedia.org/r/1293592 (owner: 10Slyngshede) [09:02:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T426633)', diff saved to https://phabricator.wikimedia.org/P92942 and previous config saved to /var/cache/conftool/dbconfig/20260526-090256-fceratto.json [09:02:57] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs4009.ulsfo.wmnet} and A:liberica [09:03:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2224.codfw.wmnet with reason: Maintenance [09:03:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2224 (T426633)', diff saved to https://phabricator.wikimedia.org/P92943 and previous config saved to /var/cache/conftool/dbconfig/20260526-090315-fceratto.json [09:03:49] I'm going to deploy the second patch now [09:04:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1293658 (https://phabricator.wikimedia.org/T427251) (owner: 10Mszwarc) [09:04:35] RESOLVED: DiskSpace: Disk space krb1002:9100:/ 0.494% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:06:11] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs4009.ulsfo.wmnet} and A:liberica [09:06:48] (03Merged) 10jenkins-bot: Fix TypeError in Mandatory2FAChecker [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1293658 (https://phabricator.wikimedia.org/T427251) (owner: 10Mszwarc) [09:07:01] (03PS1) 10Kosta Harlan: hCaptcha: Ship a self-contained Grade C captcha bundle [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293661 (https://phabricator.wikimedia.org/T422222) [09:07:15] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1293658|Fix TypeError in Mandatory2FAChecker (T427251)]] [09:07:17] (03CR) 10Ayounsi: [C:03+1] profile::rpkivalidator: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1293609 (owner: 10Muehlenhoff) [09:07:19] T427251: TypeError: MediaWiki\User\UserIdentityValue::newRegistered(): Argument #1 ($userId) must be of type int, string given, called in /srv/mediawiki/php-1.47.0-wmf.4/extensions/OATHAuth/src/Enforce2FA/Mandatory2FAChecker.php on line - https://phabricator.wikimedia.org/T427251 [09:07:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2195.codfw.wmnet with OS trixie [09:08:20] Msz2001: I’d like to deploy a patch after you’re done [09:08:30] I'll ping you when done [09:08:31] jmm@cumin2002 restart-reboot (PID 630459) is awaiting input [09:09:07] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1293658|Fix TypeError in Mandatory2FAChecker (T427251)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:09:52] !log mszwarc@deploy1003 mszwarc: Continuing with deployment [09:10:09] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#11954172 (10MoritzMuehlenhoff) [09:10:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T426633)', diff saved to https://phabricator.wikimedia.org/P92944 and previous config saved to /var/cache/conftool/dbconfig/20260526-091016-fceratto.json [09:10:53] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1226.eqiad.wmnet with reason: host reimage [09:13:49] (03PS1) 10STran: Enable IRS Direct Reporting on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293662 (https://phabricator.wikimedia.org/T425025) [09:14:02] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293658|Fix TypeError in Mandatory2FAChecker (T427251)]] (duration: 06m 47s) [09:14:07] T427251: TypeError: MediaWiki\User\UserIdentityValue::newRegistered(): Argument #1 ($userId) must be of type int, string given, called in /srv/mediawiki/php-1.47.0-wmf.4/extensions/OATHAuth/src/Enforce2FA/Mandatory2FAChecker.php on line - https://phabricator.wikimedia.org/T427251 [09:14:08] (03PS3) 10Cathal Mooney: Interface validators: allow for channelized port numbers on Juniper [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1293649 (https://phabricator.wikimedia.org/T427056) [09:14:10] kostajh: over to you [09:14:23] Msz2001: thanks! [09:14:31] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{kubestage200*} and (A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw) [09:14:45] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{kubestage100*} and (A:wikikube-staging-master-eqiad or A:wikikube-staging-worker-eqiad) [09:14:45] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet [09:14:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1226.eqiad.wmnet with reason: host reimage [09:14:58] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11954191 (10ABran-WMF) Following up on yesterday's merge, I created a [[ https://grafana.wikimedia.org/goto/efn7pi5lmtj40e?org... [09:14:58] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage1003.eqiad.wmnet [09:15:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293661 (https://phabricator.wikimedia.org/T422222) (owner: 10Kosta Harlan) [09:15:19] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet [09:15:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293662 (https://phabricator.wikimedia.org/T425025) (owner: 10STran) [09:15:54] (03CR) 10CWilliams: cookbooks/sre/mysql/decommission: add cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [09:16:14] (03CR) 10CI reject: [V:04-1] Interface validators: allow for channelized port numbers on Juniper [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1293649 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [09:16:42] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2195: Migration of db2195.codfw.wmnet completed [09:17:20] (03PS4) 10Cathal Mooney: Interface validators: allow for channelized port numbers on Juniper [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1293649 (https://phabricator.wikimedia.org/T427056) [09:18:59] (03CR) 10JMeybohm: [C:04-1] "Missing the change to site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) (owner: 10Blake) [09:19:05] (03CR) 10Cathal Mooney: Interface validators: allow for channelized port numbers on Juniper (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1293649 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [09:19:54] (03CR) 10Ayounsi: [C:03+1] Interface validators: allow for channelized port numbers on Juniper [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1293649 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [09:20:03] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage1003.eqiad.wmnet [09:20:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P92946 and previous config saved to /var/cache/conftool/dbconfig/20260526-092024-fceratto.json [09:20:58] !log start rebooting esams liberica instances (T426563) [09:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:24] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3010.esams.wmnet} and A:liberica [09:21:29] (03CR) 10JMeybohm: [C:03+1] kafka-main2006: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288917 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [09:21:41] (03CR) 10JMeybohm: [C:03+1] kafka-main2007: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288918 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [09:21:47] (03CR) 10JMeybohm: [C:03+1] kafka-main2008: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288919 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [09:21:53] (03CR) 10JMeybohm: [C:03+1] kafka-main2009: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288920 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [09:22:06] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet [09:22:08] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet [09:22:19] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2002.codfw.wmnet [09:22:51] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2002.codfw.wmnet [09:24:03] (03CR) 10JMeybohm: [C:04-1] "In this change you could remove all the host level overrides and add the hiera keys to `hieradata/role/codfw/kafka/main.yaml` instead." [puppet] - 10https://gerrit.wikimedia.org/r/1288921 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [09:24:21] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8579/co" [puppet] - 10https://gerrit.wikimedia.org/r/1293655 (https://phabricator.wikimedia.org/T427199) (owner: 10Jelto) [09:25:08] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3010.esams.wmnet} and A:liberica [09:25:16] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3008.esams.wmnet} and A:liberica [09:25:39] (03Merged) 10jenkins-bot: hCaptcha: Ship a self-contained Grade C captcha bundle [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293661 (https://phabricator.wikimedia.org/T422222) (owner: 10Kosta Harlan) [09:25:50] (03PS1) 10Atsuko: idp: adding stream-internal.w.o to allowed services [puppet] - 10https://gerrit.wikimedia.org/r/1293663 (https://phabricator.wikimedia.org/T348763) [09:26:06] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1293661|hCaptcha: Ship a self-contained Grade C captcha bundle (T422222)]] [09:26:08] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10Toolforge, 13Patch-For-Review: Adjust WMCS Gitlab CI/CD repo to stop using mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423596#11954234 (10fgiunchedi) [09:26:10] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage1003.eqiad.wmnet [09:26:11] T422222: Unable to submit edit in Basic mode - https://phabricator.wikimedia.org/T422222 [09:26:11] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage1003.eqiad.wmnet [09:26:22] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage1004.eqiad.wmnet [09:26:29] (03CR) 10Elukey: [C:03+1] Failover irc.w.o to irc1003 [dns] - 10https://gerrit.wikimedia.org/r/1293586 (owner: 10Muehlenhoff) [09:27:29] (03CR) 10Brouberol: [C:03+1] idp: adding stream-internal.w.o to allowed services [puppet] - 10https://gerrit.wikimedia.org/r/1293663 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [09:27:32] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage1004.eqiad.wmnet [09:27:58] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1293661|hCaptcha: Ship a self-contained Grade C captcha bundle (T422222)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:28:03] (03PS1) 10JavierMonton: image: Flink 2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293664 (https://phabricator.wikimedia.org/T412978) [09:28:48] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3008.esams.wmnet} and A:liberica [09:28:49] !log kharlan@deploy1003 kharlan: Continuing with deployment [09:29:00] (03PS1) 10Kosta Harlan: hCaptcha: Avoid `for (const ... of ...)` in Grade C bundle [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293665 (https://phabricator.wikimedia.org/T422222) [09:29:16] (03CR) 10Blake: "I think patchset 2 was correct, and I made some mistakes trying to work on something else in what I didn't realize was the same branch." [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) (owner: 10Blake) [09:29:17] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2002.codfw.wmnet [09:29:19] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2002.codfw.wmnet [09:29:31] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2003.codfw.wmnet [09:30:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P92947 and previous config saved to /var/cache/conftool/dbconfig/20260526-093031-fceratto.json [09:30:36] (03CR) 10Arnaudb: [C:03+1] "looks good to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1293655 (https://phabricator.wikimedia.org/T427199) (owner: 10Jelto) [09:30:56] !log elukey@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: aux-master-codfw@codfw [09:31:23] (03CR) 10Elukey: [C:03+2] services: move the aux k8s' kubemaster to IPIP load balancing [puppet] - 10https://gerrit.wikimedia.org/r/1289273 (https://phabricator.wikimedia.org/T420439) (owner: 10Elukey) [09:31:43] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1226.eqiad.wmnet with OS trixie [09:32:46] !log depooling cp2043 to install haproxy-awslc (T419825) [09:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:50] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [09:32:58] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293661|hCaptcha: Ship a self-contained Grade C captcha bundle (T422222)]] (duration: 06m 52s) [09:32:59] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp2043.* [09:33:02] T422222: Unable to submit edit in Basic mode - https://phabricator.wikimedia.org/T422222 [09:33:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293665 (https://phabricator.wikimedia.org/T422222) (owner: 10Kosta Harlan) [09:33:45] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage1004.eqiad.wmnet [09:33:46] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage1004.eqiad.wmnet [09:33:57] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage1005.eqiad.wmnet [09:34:03] (03CR) 10Fabfur: [C:03+2] hiera: using haproxy-awslc on cp2043-cp2044 [puppet] - 10https://gerrit.wikimedia.org/r/1289997 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [09:34:33] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp2044.* [09:34:34] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2003.codfw.wmnet [09:34:36] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage1005.eqiad.wmnet [09:34:39] (03PS5) 10Blake: Add wikikube-worker refreshes. [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) [09:34:44] !log depooling cp2044 to install haproxy-awslc (T419825) [09:34:46] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1293663 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [09:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:00] (03PS13) 10Clément Goubert: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [09:35:50] (03PS6) 10Blake: Add wikikube-worker refreshes. [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) [09:36:29] (03CR) 10Blake: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) (owner: 10Blake) [09:37:04] (03CR) 10JMeybohm: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) (owner: 10Blake) [09:37:55] (03CR) 10Clément Goubert: [C:03+2] gateway-check: inference post-migration cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1290019 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [09:37:56] (03CR) 10Blake: [C:03+2] Add wikikube-worker refreshes. [puppet] - 10https://gerrit.wikimedia.org/r/1290719 (https://phabricator.wikimedia.org/T418927) (owner: 10Blake) [09:38:11] !log elukey@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:38:30] (03CR) 10JMeybohm: Update to kubernetes v1.31.14. (031 comment) [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1293087 (https://phabricator.wikimedia.org/T427065) (owner: 10Blake) [09:38:58] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: block old chrome from accessing api endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1293655 (https://phabricator.wikimedia.org/T427199) (owner: 10Jelto) [09:39:08] !log elukey@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:39:08] !log elukey@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: aux-master-codfw@codfw [09:40:26] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1226: Migration of db1226.eqiad.wmnet completed [09:40:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T426633)', diff saved to https://phabricator.wikimedia.org/P92950 and previous config saved to /var/cache/conftool/dbconfig/20260526-094045-fceratto.json [09:41:01] !log fabfur@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3009.esams.wmnet} and A:liberica [09:41:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2229.codfw.wmnet with reason: Maintenance [09:41:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T426633)', diff saved to https://phabricator.wikimedia.org/P92951 and previous config saved to /var/cache/conftool/dbconfig/20260526-094115-fceratto.json [09:41:21] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage1005.eqiad.wmnet [09:41:22] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage1005.eqiad.wmnet [09:41:33] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage1006.eqiad.wmnet [09:42:20] 10SRE-swift-storage, 06Commons: Commons file not found - File:UCB Latin Extended-G.png - https://phabricator.wikimedia.org/T427188#11954317 (10Jeff_G) 05Open→03Resolved a:03Jeff_G per https://commons.wikimedia.org/w/index.php?title=Commons%3ADeletion_requests%2FFile%3AUCB_Latin_Extended-G.png&diff=12... [09:43:32] (03Merged) 10jenkins-bot: hCaptcha: Avoid `for (const ... of ...)` in Grade C bundle [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293665 (https://phabricator.wikimedia.org/T422222) (owner: 10Kosta Harlan) [09:44:01] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1293665|hCaptcha: Avoid `for (const ... of ...)` in Grade C bundle (T422222)]] [09:44:02] (03CR) 10Federico Ceratto: "(In the team meeting we discussed also removing the downtime as part of the repooling in a different CR)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [09:44:06] T422222: Unable to submit edit in Basic mode - https://phabricator.wikimedia.org/T422222 [09:44:38] (03PS1) 10Federico Ceratto: sre.mysql: Auto-lint imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1293666 (https://phabricator.wikimedia.org/T419874) [09:44:38] (03PS1) 10Kosta Harlan: hCaptcha: Avoid URL.searchParams in Grade C bundle [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293668 (https://phabricator.wikimedia.org/T422222) [09:44:55] !log fabfur@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3009.esams.wmnet} and A:liberica [09:45:53] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1293665|hCaptcha: Avoid `for (const ... of ...)` in Grade C bundle (T422222)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:46:39] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage1006.eqiad.wmnet [09:47:59] !log kharlan@deploy1003 kharlan: Continuing with deployment [09:48:01] (03PS1) 10Arnaudb: vrts: alerts for the new antispam pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1293667 (https://phabricator.wikimedia.org/T402260) [09:48:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T426633)', diff saved to https://phabricator.wikimedia.org/P92953 and previous config saved to /var/cache/conftool/dbconfig/20260526-094819-fceratto.json [09:48:41] !log repooling cp2043 and cp2044 (haproxy-awslc) (T419825) [09:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:45] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [09:49:04] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [09:49:23] jayme@cumin1003 reboot-nodes (PID 566230) is awaiting input [09:49:35] (03PS2) 10Arnaudb: vrts: alerts for the new antispam pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1293667 (https://phabricator.wikimedia.org/T402260) [09:50:56] (03CR) 10Atsuko: [C:03+2] idp: adding stream-internal.w.o to allowed services [puppet] - 10https://gerrit.wikimedia.org/r/1293663 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [09:51:35] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp2044.* [09:51:40] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp2043.* [09:51:43] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Configure qwen3-14b in rest-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [09:52:08] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293665|hCaptcha: Avoid `for (const ... of ...)` in Grade C bundle (T422222)]] (duration: 08m 07s) [09:52:12] T422222: Unable to submit edit in Basic mode - https://phabricator.wikimedia.org/T422222 [09:52:18] !log elukey@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: aux-master-eqiad@eqiad [09:52:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293668 (https://phabricator.wikimedia.org/T422222) (owner: 10Kosta Harlan) [09:52:33] FIRING: KubernetesCalicoDown: wikikube-worker2364.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2364.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:53:36] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage1006.eqiad.wmnet [09:53:37] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage1006.eqiad.wmnet [09:53:37] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{kubestage100*} and (A:wikikube-staging-master-eqiad or A:wikikube-staging-worker-eqiad) [09:53:56] (03CR) 10Mszwarc: [C:03+1] Enable IRS Direct Reporting on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293662 (https://phabricator.wikimedia.org/T425025) (owner: 10STran) [09:54:08] (03Merged) 10jenkins-bot: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [09:54:20] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2003.codfw.wmnet [09:54:22] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2003.codfw.wmnet [09:54:33] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2004.codfw.wmnet [09:55:06] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2004.codfw.wmnet [09:55:35] !log elukey@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [09:55:45] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:55:59] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:56:35] FIRING: DiskSpace: Disk space krb1002:9100:/ 2.056% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:56:41] !log elukey@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [09:56:41] !log elukey@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: aux-master-eqiad@eqiad [09:57:32] FIRING: [5x] KubernetesCalicoDown: wikikube-worker2358.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:57:42] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:58:05] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:58:07] bjensen: the KubernetesCalicoDown are the hosts you're working on? [09:58:12] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:58:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P92955 and previous config saved to /var/cache/conftool/dbconfig/20260526-095827-fceratto.json [09:58:32] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:58:33] claime: ah, sorry, will take a look [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1000) [10:00:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox [10:01:40] ^ I’m nearly done with the wmf.3 backports [10:01:50] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2004.codfw.wmnet [10:01:52] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2004.codfw.wmnet [10:01:52] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{kubestage200*} and (A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw) [10:02:12] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2195: Migration of db2195.codfw.wmnet completed [10:02:13] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:02:32] FIRING: [7x] KubernetesCalicoDown: wikikube-worker2358.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:03:31] (03Merged) 10jenkins-bot: hCaptcha: Avoid URL.searchParams in Grade C bundle [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293668 (https://phabricator.wikimedia.org/T422222) (owner: 10Kosta Harlan) [10:03:57] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1293668|hCaptcha: Avoid URL.searchParams in Grade C bundle (T422222)]] [10:04:01] T422222: Unable to submit edit in Basic mode - https://phabricator.wikimedia.org/T422222 [10:05:50] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1293668|hCaptcha: Avoid URL.searchParams in Grade C bundle (T422222)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:06:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:35] RESOLVED: DiskSpace: Disk space krb1002:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:06:35] !log kharlan@deploy1003 kharlan: Continuing with deployment [10:07:32] FIRING: [10x] KubernetesCalicoDown: wikikube-worker2357.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:08:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P92957 and previous config saved to /var/cache/conftool/dbconfig/20260526-100834-fceratto.json [10:09:32] !log elukey@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: aux-master-codfw@codfw [10:09:37] (03CR) 10Elukey: [C:03+2] service: move Aux k8s' ingress to IPIP load balancing [puppet] - 10https://gerrit.wikimedia.org/r/1289274 (https://phabricator.wikimedia.org/T420439) (owner: 10Elukey) [10:10:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:40] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293668|hCaptcha: Avoid URL.searchParams in Grade C bundle (T422222)]] (duration: 06m 42s) [10:10:44] T422222: Unable to submit edit in Basic mode - https://phabricator.wikimedia.org/T422222 [10:11:28] Done [10:12:33] FIRING: [13x] KubernetesCalicoDown: wikikube-worker2357.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:15:40] !log elukey@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:16:36] !log elukey@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:16:36] !log elukey@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: aux-master-codfw@codfw [10:17:33] FIRING: [18x] KubernetesCalicoDown: wikikube-worker2357.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:18:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T426633)', diff saved to https://phabricator.wikimedia.org/P92959 and previous config saved to /var/cache/conftool/dbconfig/20260526-101842-fceratto.json [10:19:29] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [10:19:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2175 (T426633)', diff saved to https://phabricator.wikimedia.org/P92960 and previous config saved to /var/cache/conftool/dbconfig/20260526-101936-fceratto.json [10:21:59] (03CR) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [10:22:33] FIRING: [18x] KubernetesCalicoDown: wikikube-worker2357.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:24:11] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:24:34] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2181: Upgrading db2181.codfw.wmnet [10:25:04] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2181: Upgrading db2181.codfw.wmnet [10:25:34] (03PS1) 10JMeybohm: kube-state-metrics: Update to v2.14.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293673 (https://phabricator.wikimedia.org/T388387) [10:25:55] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1226: Migration of db1226.eqiad.wmnet completed [10:25:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:27:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T426633)', diff saved to https://phabricator.wikimedia.org/P92963 and previous config saved to /var/cache/conftool/dbconfig/20260526-102703-fceratto.json [10:27:33] RESOLVED: [18x] KubernetesCalicoDown: wikikube-worker2357.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:28:04] cwilliams@cumin1003 major-upgrade (PID 620962) is awaiting input [10:29:44] (03PS2) 10Elukey: redifish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [10:31:38] (03PS1) 10JMeybohm: kube-state-metrics: Update to default to v2.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293675 (https://phabricator.wikimedia.org/T388387) [10:32:23] (03PS3) 10Elukey: redifish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [10:32:44] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/eventstreams-internal: apply [10:32:46] (03PS2) 10JMeybohm: kube-state-metrics: Update to default to v2.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293675 (https://phabricator.wikimedia.org/T388387) [10:32:52] (03CR) 10Effie Mouzeli: "nit: please at task #" [puppet] - 10https://gerrit.wikimedia.org/r/1293656 (owner: 10Blake) [10:32:57] (03CR) 10Effie Mouzeli: [C:03+1] site.pp: re-add mc1054 for trixie testing [puppet] - 10https://gerrit.wikimedia.org/r/1293656 (owner: 10Blake) [10:33:09] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/eventstreams-internal: apply [10:36:24] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2181.codfw.wmnet with OS trixie [10:36:25] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:37:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P92964 and previous config saved to /var/cache/conftool/dbconfig/20260526-103711-fceratto.json [10:37:17] (03PS1) 10Marostegui: db1222: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293677 (https://phabricator.wikimedia.org/T424615) [10:38:20] jouncebot: now [10:38:20] For the next 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1000) [10:38:41] (03CR) 10Marostegui: [C:03+2] db1222: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293677 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [10:39:29] (03PS2) 10Effie Mouzeli: ProductionServices.php: switch filebackend.php to rdb2011:6382 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293095 (https://phabricator.wikimedia.org/T418261) [10:41:49] (03PS3) 10Blake: site.pp: re-add mc1054 for trixie testing [puppet] - 10https://gerrit.wikimedia.org/r/1293656 (https://phabricator.wikimedia.org/T426044) [10:42:03] (03CR) 10Clément Goubert: [C:03+1] "Different instance on same node is intended." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293095 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [10:42:21] (03CR) 10Blake: "Done, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1293656 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [10:42:53] (03CR) 10Effie Mouzeli: "Aye, known issue, has been like that for a few years now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293095 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [10:43:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jiji@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293095 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [10:43:47] (03CR) 10Blake: [C:03+1] kube-state-metrics: Update to v2.14.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293673 (https://phabricator.wikimedia.org/T388387) (owner: 10JMeybohm) [10:43:57] (03CR) 10Blake: [C:03+1] kube-state-metrics: Update to default to v2.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293675 (https://phabricator.wikimedia.org/T388387) (owner: 10JMeybohm) [10:44:24] (03Merged) 10jenkins-bot: ProductionServices.php: switch filebackend.php to rdb2011:6382 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293095 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [10:44:49] !log jiji@deploy1003 Started scap sync-world: Backport for [[gerrit:1293095|ProductionServices.php: switch filebackend.php to rdb2011:6382 (T418261 T419976)]] [10:44:55] T418261: rdb20[11-12] implementation tracking - https://phabricator.wikimedia.org/T418261 [10:44:56] T419976: Upgrade redis_misc hosts to Debian Trixie (Redis 8.0) - https://phabricator.wikimedia.org/T419976 [10:45:45] (03PS1) 10Marostegui: pc1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293680 (https://phabricator.wikimedia.org/T418973) [10:45:55] (03PS1) 10Jelto: Revert "gitlab: block old chrome from accessing api endpoints" [puppet] - 10https://gerrit.wikimedia.org/r/1293681 (https://phabricator.wikimedia.org/T427199) [10:46:33] (03CR) 10Marostegui: [C:03+2] pc1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293680 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [10:46:42] !log jiji@deploy1003 jiji: Backport for [[gerrit:1293095|ProductionServices.php: switch filebackend.php to rdb2011:6382 (T418261 T419976)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:47:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P92966 and previous config saved to /var/cache/conftool/dbconfig/20260526-104718-fceratto.json [10:51:33] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2181.codfw.wmnet with reason: host reimage [10:52:37] (03CR) 10Jelto: [C:03+2] Revert "gitlab: block old chrome from accessing api endpoints" [puppet] - 10https://gerrit.wikimedia.org/r/1293681 (https://phabricator.wikimedia.org/T427199) (owner: 10Jelto) [10:55:34] FIRING: DiskSpace: Disk space krb1002:9100:/ 2.34% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:55:40] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2181.codfw.wmnet with reason: host reimage [10:56:29] !log jiji@deploy1003 jiji: Continuing with deployment [10:56:52] (03CR) 10CWilliams: cookbooks/sre/mysql/decommission: add cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [10:57:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T426633)', diff saved to https://phabricator.wikimedia.org/P92967 and previous config saved to /var/cache/conftool/dbconfig/20260526-105726-fceratto.json [10:57:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [10:57:54] (03PS4) 10Elukey: redifish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [10:57:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2189 (T426633)', diff saved to https://phabricator.wikimedia.org/P92968 and previous config saved to /var/cache/conftool/dbconfig/20260526-105755-fceratto.json [10:59:06] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:59:27] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1214: Upgrading db1214.eqiad.wmnet [11:00:16] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1214: Upgrading db1214.eqiad.wmnet [11:00:40] !log jiji@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293095|ProductionServices.php: switch filebackend.php to rdb2011:6382 (T418261 T419976)]] (duration: 15m 50s) [11:00:46] T418261: rdb20[11-12] implementation tracking - https://phabricator.wikimedia.org/T418261 [11:00:46] T419976: Upgrade redis_misc hosts to Debian Trixie (Redis 8.0) - https://phabricator.wikimedia.org/T419976 [11:02:02] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1214.eqiad.wmnet with OS trixie [11:03:02] (03PS5) 10Elukey: redifish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [11:03:07] (03CR) 10Marostegui: [C:04-1] sre.mysql.global-read-only Set all sections as RO/RW (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:04:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:04:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T426633)', diff saved to https://phabricator.wikimedia.org/P92971 and previous config saved to /var/cache/conftool/dbconfig/20260526-110458-fceratto.json [11:05:35] RESOLVED: DiskSpace: Disk space krb1002:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:06:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:07:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in ms-be2089 - https://phabricator.wikimedia.org/T427266 (10MatthewVernon) 03NEW [11:07:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in ms-be2089 - https://phabricator.wikimedia.org/T427266#11954643 (10MatthewVernon) p:05Triage→03High [11:08:21] (03PS6) 10Elukey: redifish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [11:09:17] jouncebot: nowandnext [11:09:17] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [11:09:17] In 0 hour(s) and 50 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1200) [11:09:29] (03CR) 10CWilliams: [C:03+1] sre.mysql: Auto-lint imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1293666 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:09:38] (03PS1) 10Lucas Werkmeister (WMDE): Fix path to wikibase.wikiprojects.tracking.js [extensions/Wikibase] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1293691 (https://phabricator.wikimedia.org/T421856) [11:09:45] I’ll deploy ^ to fix T427252 (train blocker) [11:09:45] T427252: Wikibase: Missing wikibase.wikiprojects.tracking.js file causing exceptions on every request - https://phabricator.wikimedia.org/T427252 [11:09:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1293691 (https://phabricator.wikimedia.org/T421856) (owner: 10Lucas Werkmeister (WMDE)) [11:13:22] (03CR) 10CWilliams: "@fceratto@wikimedia.org this sounds reasonable to me, was there an outcome to this?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [11:14:16] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2181.codfw.wmnet with OS trixie [11:14:26] (03CR) 10Elukey: "Jesse lemme know what you think about this, I tried to keep it as generic as possible. Tested in in my script on sretest2010, all good. Th" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [11:14:58] (03PS1) 10Marostegui: instances.yaml: Add pc1024 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1293692 (https://phabricator.wikimedia.org/T418973) [11:15:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P92972 and previous config saved to /var/cache/conftool/dbconfig/20260526-111506-fceratto.json [11:17:02] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1214.eqiad.wmnet with reason: host reimage [11:17:30] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add pc1024 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1293692 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [11:19:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:20:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Switchover es2042 es2041 for T426199', diff saved to https://phabricator.wikimedia.org/P92974 and previous config saved to /var/cache/conftool/dbconfig/20260526-112028-fceratto.json [11:20:34] T426199: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 [11:21:26] (03PS11) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) [11:21:26] (03PS2) 10Kamila Součková: Remove k8s version from all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) [11:22:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc1024 to dbctl T418973', diff saved to https://phabricator.wikimedia.org/P92975 and previous config saved to /var/cache/conftool/dbconfig/20260526-112215-marostegui.json [11:22:20] T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973 [11:22:41] (03PS1) 10Jelto: gitlab: block old chrome [puppet] - 10https://gerrit.wikimedia.org/r/1293696 (https://phabricator.wikimedia.org/T427199) [11:22:52] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2181: Migration of db2181.codfw.wmnet completed [11:23:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc4 T418973', diff saved to https://phabricator.wikimedia.org/P92977 and previous config saved to /var/cache/conftool/dbconfig/20260526-112326-marostegui.json [11:23:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:04] (03CR) 10Arnaudb: [C:03+1] "good luck with the retry, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1293696 (https://phabricator.wikimedia.org/T427199) (owner: 10Jelto) [11:24:05] (03PS1) 10Marostegui: pc1024: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293697 (https://phabricator.wikimedia.org/T418973) [11:24:10] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1214.eqiad.wmnet with reason: host reimage [11:24:14] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:24:50] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8580/co" [puppet] - 10https://gerrit.wikimedia.org/r/1293696 (https://phabricator.wikimedia.org/T427199) (owner: 10Jelto) [11:25:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P92978 and previous config saved to /var/cache/conftool/dbconfig/20260526-112513-fceratto.json [11:25:32] (03PS1) 10Clément Goubert: trafficserver: Default most APIs to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1293699 (https://phabricator.wikimedia.org/T422937) [11:26:47] (03CR) 10Marostegui: [C:03+2] pc1024: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293697 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [11:27:31] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Complete rollout to all wikis (group2 + cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293167 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [11:27:41] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Exempt CommunityRequests pages from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290055 (https://phabricator.wikimedia.org/T426897) (owner: 10Kosta Harlan) [11:27:52] jouncebot: nowandnext [11:27:52] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [11:27:52] In 0 hour(s) and 32 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1200) [11:27:56] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [11:27:57] Going to use scap [11:28:01] Dreamy_Jazz: I’m currently deploying a backport [11:28:07] (waiting for gate-and-submit but it should be almost done) [11:28:09] Oh, thanks for the heads up [11:28:20] (https://spiderpig.wikimedia.org/jobs/2083) [11:28:32] (03Merged) 10jenkins-bot: Fix path to wikibase.wikiprojects.tracking.js [extensions/Wikibase] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1293691 (https://phabricator.wikimedia.org/T421856) (owner: 10Lucas Werkmeister (WMDE)) [11:28:34] Mind pinging me when done? [11:28:37] sure [11:28:39] Thanks [11:28:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:59] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1293691|Fix path to wikibase.wikiprojects.tracking.js (T421856 T427252)]] [11:29:04] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [11:29:05] T421856: [WIPR] Prototype - Track clicks of Wikiproject link on item page - https://phabricator.wikimedia.org/T421856 [11:29:05] T427252: Wikibase: Missing wikibase.wikiprojects.tracking.js file causing exceptions on every request - https://phabricator.wikimedia.org/T427252 [11:30:08] (03CR) 10Effie Mouzeli: [C:03+2] role::mediawiki::memcached::wikifunctions: add new role [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) (owner: 10Effie Mouzeli) [11:30:18] Dreamy_Jazz: config change or backport? [11:30:32] ok, based on wikibugs backscroll I’m guessing config change [11:30:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1222: Migration of db1222.eqiad.wmnet completed [11:30:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [11:30:39] (if it was a backport you could already +2 it) [11:30:40] Yes [11:30:42] ok [11:30:57] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1293691|Fix path to wikibase.wikiprojects.tracking.js (T421856 T427252)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:31:01] * Lucas_WMDE looks [11:31:16] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#11954811 (10MoritzMuehlenhoff) [11:31:17] (03CR) 10CWilliams: sre.mysql.global-read-only Set all sections as RO/RW (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:31:28] I don’t see any more RuntimeException, yay [11:31:35] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with deployment [11:33:34] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11954818 (10MoritzMuehlenhoff) [11:33:50] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#11954819 (10MoritzMuehlenhoff) [11:33:58] (03CR) 10CWilliams: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:34:34] (03PS1) 10Marco Fossati: MultimediaViewer: enable image carousel as a beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293701 (https://phabricator.wikimedia.org/T426799) [11:35:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T426633)', diff saved to https://phabricator.wikimedia.org/P92980 and previous config saved to /var/cache/conftool/dbconfig/20260526-113521-fceratto.json [11:35:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Maintenance [11:35:35] Dreamy_Jazz: over to you (currently waiting 20 seconds for production traffic) [11:35:41] Thanks [11:35:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2207 (T426633)', diff saved to https://phabricator.wikimedia.org/P92981 and previous config saved to /var/cache/conftool/dbconfig/20260526-113542-fceratto.json [11:35:45] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293691|Fix path to wikibase.wikiprojects.tracking.js (T421856 T427252)]] (duration: 06m 46s) [11:35:52] T421856: [WIPR] Prototype - Track clicks of Wikiproject link on item page - https://phabricator.wikimedia.org/T421856 [11:35:53] T427252: Wikibase: Missing wikibase.wikiprojects.tracking.js file causing exceptions on every request - https://phabricator.wikimedia.org/T427252 [11:36:15] * Lucas_WMDE watches logspam-watch for a bit to see if the errors drop off [11:36:25] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:43] (03PS1) 10Marostegui: mariadb: Decommission pc1013 [puppet] - 10https://gerrit.wikimedia.org/r/1293702 (https://phabricator.wikimedia.org/T427190) [11:39:21] (03PS1) 10Clément Goubert: trafficserver: Route /media/math directly to restbase [puppet] - 10https://gerrit.wikimedia.org/r/1293703 (https://phabricator.wikimedia.org/T422937) [11:39:24] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11954848 (10Marostegui) [11:40:33] (03CR) 10Federico Ceratto: "For cookbooks and internal team scripts I prefer to capture the output for different reasons:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [11:41:07] (03CR) 10CI reject: [V:04-1] Remove k8s version from all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [11:41:19] (03CR) 10Marco Fossati: "Hey @jforrester@wikimedia.org, Reader Growth would like to launch the image carousel beta feature. Would you mind having a look & +1? Than" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293701 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [11:41:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1214.eqiad.wmnet with OS trixie [11:42:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293167 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [11:42:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290055 (https://phabricator.wikimedia.org/T426897) (owner: 10Kosta Harlan) [11:42:13] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf1002.eqiad.wmnet [11:42:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T426633)', diff saved to https://phabricator.wikimedia.org/P92983 and previous config saved to /var/cache/conftool/dbconfig/20260526-114243-fceratto.json [11:42:56] FIRING: ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:37] (03Merged) 10jenkins-bot: hCaptcha: Complete rollout to all wikis (group2 + cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293167 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [11:43:41] (03Merged) 10jenkins-bot: hCaptcha: Exempt CommunityRequests pages from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290055 (https://phabricator.wikimedia.org/T426897) (owner: 10Kosta Harlan) [11:44:05] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1293167|hCaptcha: Complete rollout to all wikis (group2 + cleanup) (T425354)]], [[gerrit:1290055|hCaptcha: Exempt CommunityRequests pages from edit/create triggers (T426897)]] [11:44:10] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [11:44:11] T426897: hCaptcha: Add support in CommunityRequests extension - https://phabricator.wikimedia.org/T426897 [11:45:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1004.eqiad.wmnet [11:45:59] !log dreamyjazz@deploy1003 kharlan, dreamyjazz: Backport for [[gerrit:1293167|hCaptcha: Complete rollout to all wikis (group2 + cleanup) (T425354)]], [[gerrit:1290055|hCaptcha: Exempt CommunityRequests pages from edit/create triggers (T426897)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:47:44] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1002.eqiad.wmnet [11:47:56] RESOLVED: ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:49:16] (03PS1) 10Clément Goubert: trafficserver: Remove all gateway-check config [puppet] - 10https://gerrit.wikimedia.org/r/1293704 (https://phabricator.wikimedia.org/T422937) [11:49:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1004.eqiad.wmnet [11:49:59] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1214: Migration of db1214.eqiad.wmnet completed [11:51:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1005.eqiad.wmnet [11:52:21] (03CR) 10Muehlenhoff: [C:03+2] Failover irc.w.o to irc1003 [dns] - 10https://gerrit.wikimedia.org/r/1293586 (owner: 10Muehlenhoff) [11:52:41] !log jmm@dns1004 START - running authdns-update [11:52:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P92985 and previous config saved to /var/cache/conftool/dbconfig/20260526-115251-fceratto.json [11:53:09] (03PS10) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [11:53:13] Still testing [11:54:20] !log jmm@dns1004 END - running authdns-update [11:54:51] !log stopping mediabackups@codfw for maintenance on a codfw backup media storage server T426199 [11:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:55] T426199: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 [11:55:19] !log dreamyjazz@deploy1003 kharlan, dreamyjazz: Continuing with deployment [11:55:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1005.eqiad.wmnet [11:55:47] jynus: does db2197 need some special care too? [11:56:03] (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:56:38] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on backup2015.codfw.wmnet,db2197.codfw.wmnet with reason: network maintenance [11:57:35] FIRING: DiskSpace: Disk space krb1002:9100:/ 1.815% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:57:42] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: block old chrome [puppet] - 10https://gerrit.wikimedia.org/r/1293696 (https://phabricator.wikimedia.org/T427199) (owner: 10Jelto) [11:58:02] great, thanks! [11:59:31] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293167|hCaptcha: Complete rollout to all wikis (group2 + cleanup) (T425354)]], [[gerrit:1290055|hCaptcha: Exempt CommunityRequests pages from edit/create triggers (T426897)]] (duration: 15m 26s) [11:59:38] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [11:59:38] T426897: hCaptcha: Add support in CommunityRequests extension - https://phabricator.wikimedia.org/T426897 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1200) [12:00:45] (03CR) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [12:01:43] !log start ssw1-a1-codfw network maintenance (no impact expected as the spines are redundant) [12:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P92987 and previous config saved to /var/cache/conftool/dbconfig/20260526-120258-fceratto.json [12:05:56] (03PS2) 10Jon Harald Søby: Disable the `no` language code for translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293706 (https://phabricator.wikimedia.org/T424613) [12:06:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293706 (https://phabricator.wikimedia.org/T424613) (owner: 10Jon Harald Søby) [12:06:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:35] RESOLVED: DiskSpace: Disk space krb1002:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:08:21] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2181: Migration of db2181.codfw.wmnet completed [12:08:22] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:08:43] !log downtime, disable puppet and stop pybal for rack maintenance (T426199) [12:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:47] T426199: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 [12:09:06] (03CR) 10Majavah: [C:03+1] Replace role::mariadb::ferm with profile::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1292033 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [12:09:26] !log fabfur@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: Planned downtime for rack maintenance [12:09:48] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293707 [12:10:36] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292254 (owner: 10PipelineBot) [12:13:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T426633)', diff saved to https://phabricator.wikimedia.org/P92990 and previous config saved to /var/cache/conftool/dbconfig/20260526-121306-fceratto.json [12:13:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2225.codfw.wmnet with reason: Maintenance [12:13:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2225 (T426633)', diff saved to https://phabricator.wikimedia.org/P92991 and previous config saved to /var/cache/conftool/dbconfig/20260526-121336-fceratto.json [12:16:26] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293707 (owner: 10PipelineBot) [12:17:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mirror1001.wikimedia.org [12:18:36] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293707 (owner: 10PipelineBot) [12:19:52] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:20:13] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:20:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T426633)', diff saved to https://phabricator.wikimedia.org/P92993 and previous config saved to /var/cache/conftool/dbconfig/20260526-122044-fceratto.json [12:22:17] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:22:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks great" [puppet] - 10https://gerrit.wikimedia.org/r/1292033 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [12:23:00] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:23:12] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:23:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mirror1001.wikimedia.org [12:24:01] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:24:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:24:43] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ssw1-a1-codfw,ssw1-a1-codfw IPv6,ssw1-a1-codfw.mgmt with reason: Switch maintenance [12:26:09] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp2043.* [12:26:36] !log depooled cp204 for network activity (T426199) [12:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:40] T426199: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 [12:29:14] (03CR) 10Tiziano Fogli: alerts: add transformations option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1291947 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [12:29:14] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:30:29] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: align bastion_hosts puppet type [puppet] - 10https://gerrit.wikimedia.org/r/1291946 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [12:30:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P92995 and previous config saved to /var/cache/conftool/dbconfig/20260526-123052-fceratto.json [12:31:07] (03CR) 10Tiziano Fogli: [C:03+1] toolforge: use alerts::deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1291948 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [12:31:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:33:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org [12:35:28] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1214: Migration of db1214.eqiad.wmnet completed [12:35:29] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:36:00] (03PS4) 10Ladsgroup: mariadb: Migrate public dbproxies to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290080 (https://phabricator.wikimedia.org/T421705) [12:36:05] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Migrate public dbproxies to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290080 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [12:36:25] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org [12:38:32] (03PS4) 10Ladsgroup: wikireplicas: Migrate from ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) [12:38:39] (03CR) 10Ladsgroup: [V:03+2 C:03+2] wikireplicas: Migrate from ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [12:40:55] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:40:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P92997 and previous config saved to /var/cache/conftool/dbconfig/20260526-124059-fceratto.json [12:44:22] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "The diff is like this:" [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [12:45:18] (03PS1) 10Ladsgroup: Site info should output thumblimits as array [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293710 (https://phabricator.wikimedia.org/T427066) [12:45:31] jouncebot: nowandnext [12:45:31] For the next 0 hour(s) and 14 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1200) [12:45:31] In 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1300) [12:50:03] (03CR) 10Arnaudb: [C:03+2] dns.admin: add gitlab-addrs resource [cookbooks] - 10https://gerrit.wikimedia.org/r/1290676 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [12:50:23] (03CR) 10Arnaudb: [C:03+2] conftool-data: geodns: add gitlab-addrs [puppet] - 10https://gerrit.wikimedia.org/r/1290677 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [12:51:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T426633)', diff saved to https://phabricator.wikimedia.org/P92998 and previous config saved to /var/cache/conftool/dbconfig/20260526-125105-fceratto.json [12:51:17] !log kartik@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:51:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2226.codfw.wmnet with reason: Maintenance [12:51:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2226 (T426633)', diff saved to https://phabricator.wikimedia.org/P92999 and previous config saved to /var/cache/conftool/dbconfig/20260526-125135-fceratto.json [12:51:44] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11955077 (10MoritzMuehlenhoff) [12:52:05] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#11955078 (10MoritzMuehlenhoff) [12:53:09] (03Merged) 10jenkins-bot: dns.admin: add gitlab-addrs resource [cookbooks] - 10https://gerrit.wikimedia.org/r/1290676 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [12:57:52] (03CR) 10Jforrester: [C:03+1] "LGTM. Go for it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293701 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [12:58:15] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279 (10KMontalva-WMF) 03NEW [12:58:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T426633)', diff saved to https://phabricator.wikimedia.org/P93000 and previous config saved to /var/cache/conftool/dbconfig/20260526-125834-fceratto.json [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1300). [13:00:05] aude, stephanebisson, Tran, and Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] o/ [13:00:11] o/ [13:00:11] o/ [13:00:32] I can deploy if needed [13:00:42] !log deactivate CR BGP to doh2002 to test backup path via doh2001 [13:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:40] stephanebisson: do you want to start with your backport? [13:01:46] (03CR) 10Atsuko: "I'm ready to put cautious +1 if it is needed to unblock further work. However, upgrading the final image to trixie/python3.13 in Iba13d265" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293664 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [13:01:47] Sure [13:02:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293177 (https://phabricator.wikimedia.org/T422146) (owner: 10Sbisson) [13:02:13] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11955114 (10MatthewVernon) [13:02:35] 10SRE-swift-storage, 06Commons, 06Data-Persistence, 10MediaWiki-File-management, 10Thumbor: Commons file page should use standard thumb sizes - https://phabricator.wikimedia.org/T426970#11955119 (10MatthewVernon) →14Duplicate dup:03T401668 [13:03:14] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:03:43] o/ present [13:03:51] (sorry, didn't notice the ping till now) [13:04:01] hi! I can deploy your config change after stephanebisson [13:04:07] super! [13:04:07] !log Update Recommendation API to 2026-05-26-074931-production [13:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:37] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [13:04:42] (03Merged) 10jenkins-bot: Instrumentation: log new articles namespace and source [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293177 (https://phabricator.wikimedia.org/T422146) (owner: 10Sbisson) [13:04:49] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2167: Upgrading db2167.codfw.wmnet [13:05:07] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1293177|Instrumentation: log new articles namespace and source (T422146)]] [13:05:11] T422146: Experiment config and schema registration (Article Guidance initial intervention) - https://phabricator.wikimedia.org/T422146 [13:05:19] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2167: Upgrading db2167.codfw.wmnet [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:16] 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations: Create new S3 backends for the Docker Registry service - https://phabricator.wikimedia.org/T427175#11955134 (10MatthewVernon) [13:06:59] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1293177|Instrumentation: log new articles namespace and source (T422146)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:12] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2167.codfw.wmnet with OS trixie [13:07:24] (03PS4) 10Dzahn: tcpproxy: add support for gitlab-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) [13:07:53] (03PS7) 10Arnaudb: lvs7003: add gitlab-ssh and gitlab-https [puppet] - 10https://gerrit.wikimedia.org/r/1291898 (https://phabricator.wikimedia.org/T425441) [13:08:00] !log sbisson@deploy1003 sbisson: Continuing with deployment [13:08:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P93003 and previous config saved to /var/cache/conftool/dbconfig/20260526-130842-fceratto.json [13:09:15] (03PS7) 10Arnaudb: service: add gitlab-https and gitlab-ssh service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) [13:09:17] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [13:12:16] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293177|Instrumentation: log new articles namespace and source (T422146)]] (duration: 07m 09s) [13:12:20] T422146: Experiment config and schema registration (Article Guidance initial intervention) - https://phabricator.wikimedia.org/T422146 [13:12:39] I'm done, back to you Lucas_WMDE [13:12:42] thanks! [13:12:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293706 (https://phabricator.wikimedia.org/T424613) (owner: 10Jon Harald Søby) [13:13:45] (03Merged) 10jenkins-bot: Disable the `no` language code for translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293706 (https://phabricator.wikimedia.org/T424613) (owner: 10Jon Harald Søby) [13:14:03] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11955183 (10elukey) Found another Redfish issue, this task need to wait for https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1293593 to be merged and deployed (new spicerack rele... [13:14:10] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1293706|Disable the `no` language code for translation (T424613)]] [13:14:11] (03CR) 10Ssingh: "We will be discussing this today in the Traffic meeting and I will follow up after that." [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [13:14:14] T424613: Special:Translate redirects links from no to nb - https://phabricator.wikimedia.org/T424613 [13:16:00] (03CR) 10Kamila Součková: [C:03+1] kube-state-metrics: Update to v2.14.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293673 (https://phabricator.wikimedia.org/T388387) (owner: 10JMeybohm) [13:16:02] !log lucaswerkmeister-wmde@deploy1003 jhsoby, lucaswerkmeister-wmde: Backport for [[gerrit:1293706|Disable the `no` language code for translation (T424613)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:16:18] Jhs: please test :) [13:16:44] (03CR) 10Kamila Součková: [C:03+1] kube-state-metrics: Update to default to v2.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293675 (https://phabricator.wikimedia.org/T388387) (owner: 10JMeybohm) [13:18:10] PROBLEM - Host cp2043 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:24] RECOVERY - Host cp2043 is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms [13:18:28] Lucas_WMDE, works as expected 👍 [13:18:32] !log lucaswerkmeister-wmde@deploy1003 jhsoby, lucaswerkmeister-wmde: Continuing with deployment [13:18:34] thanks! [13:18:37] (03PS2) 10Ladsgroup: wikimedia.org: Add DNS record for conductwiki [dns] - 10https://gerrit.wikimedia.org/r/1292347 (https://phabricator.wikimedia.org/T426984) [13:18:45] (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Add DNS record for conductwiki [dns] - 10https://gerrit.wikimedia.org/r/1292347 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [13:18:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P93004 and previous config saved to /var/cache/conftool/dbconfig/20260526-131850-fceratto.json [13:20:40] (03CR) 10Ladsgroup: [V:03+2 C:03+2] wikimedia.org: Add DNS record for conductwiki [dns] - 10https://gerrit.wikimedia.org/r/1292347 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [13:20:46] !log ladsgroup@dns1004 START - running authdns-update [13:21:30] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:21:39] (03CR) 10Kamila Součková: [C:03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1293699 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [13:22:26] !log ladsgroup@dns1004 END - running authdns-update [13:22:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:22:40] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293706|Disable the `no` language code for translation (T424613)]] (duration: 08m 30s) [13:22:44] T424613: Special:Translate redirects links from no to nb - https://phabricator.wikimedia.org/T424613 [13:22:51] I don’t see aude yet, so Tran: over to you ^^ [13:23:39] 👍 k starting my deploy then [13:24:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293662 (https://phabricator.wikimedia.org/T425025) (owner: 10STran) [13:25:10] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2167.codfw.wmnet with reason: host reimage [13:25:16] (03Merged) 10jenkins-bot: Enable IRS Direct Reporting on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293662 (https://phabricator.wikimedia.org/T425025) (owner: 10STran) [13:25:40] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1293662|Enable IRS Direct Reporting on testwiki (T425025)]] [13:25:44] T425025: Implement email direct reporting for IRS - https://phabricator.wikimedia.org/T425025 [13:27:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:27:35] !log stran@deploy1003 stran: Backport for [[gerrit:1293662|Enable IRS Direct Reporting on testwiki (T425025)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:27:57] testing now [13:28:00] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lsw1-a2-codfw,lsw1-a2-codfw IPv6,lsw1-a2-codfw.mgmt with reason: Switch maintenance [13:28:52] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279#11955237 (10HShaikh) as kevin's manager. I approve this request. Access is needed to run workflow that pulls data from a hive table. and it is too heavy to run in superset... [13:28:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T426633)', diff saved to https://phabricator.wikimedia.org/P93005 and previous config saved to /var/cache/conftool/dbconfig/20260526-132857-fceratto.json [13:29:08] 06SRE, 10SRE-Access-Requests: Requesting Access to Analytics Data Lake - https://phabricator.wikimedia.org/T427197#11955243 (10HShaikh) as luvos manager. I approve this request. Access is needed to run workflow that pulls data from a hive table. and it is too heavy to run in superset and export. [13:29:18] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 34 hosts with reason: Switch maintenance [13:29:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2238.codfw.wmnet with reason: Maintenance [13:29:23] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2167.codfw.wmnet with reason: host reimage [13:29:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2238 (T426633)', diff saved to https://phabricator.wikimedia.org/P93006 and previous config saved to /var/cache/conftool/dbconfig/20260526-132927-fceratto.json [13:29:49] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2003.codfw.wmnet,wikikube-worker[2248-2250].codfw.wmnet [13:30:50] (spike of MultiHttpClient errors at the top of logspam-watch is T369186 I think, nothing new) [13:30:50] looks good, continuing [13:30:50] T369186: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186 [13:30:57] !log stran@deploy1003 stran: Continuing with deployment [13:31:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:31:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2003.codfw.wmnet,wikikube-worker[2248-2250].codfw.wmnet [13:34:05] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2196: switch maintenance [13:34:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2196: switch maintenance [13:34:53] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2221: switch maintenance [13:35:08] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293662|Enable IRS Direct Reporting on testwiki (T425025)]] (duration: 09m 28s) [13:35:12] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2221: switch maintenance [13:35:13] T425025: Implement email direct reporting for IRS - https://phabricator.wikimedia.org/T425025 [13:35:17] done [13:35:21] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2222: switch maintenance [13:35:50] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2222: switch maintenance [13:35:59] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2223: switch maintenance [13:36:08] Lucas_WMDE: just aude left I think, if they're here [13:36:19] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2223: switch maintenance [13:36:40] !log reboot lsw1-a2-codfw for software upgrade - T426199 [13:36:43] Tran: thanks! yeah let’s wait a bit and then close the window [13:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:44] T426199: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 [13:36:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T426633)', diff saved to https://phabricator.wikimedia.org/P93011 and previous config saved to /var/cache/conftool/dbconfig/20260526-133656-fceratto.json [13:37:50] (03PS7) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [13:37:50] (03PS1) 10Elukey: Fix datetime-related warnings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293719 [13:40:12] (03CR) 10FNegri: [C:03+1] "I'm fine with keeping the current version of this patch if you prefer it this way. :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [13:40:39] FIRING: CoreBGPDown: Core BGP session down between ssw1-a8-codfw and lsw1-a2-codfw (10.192.252.4) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-a8-codfw:9804&var-bgp_group=EVPN_IBGP&var-bgp_neighbor=lsw1-a2-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:40:43] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:40:51] PROBLEM - BFD status on ssw1-a8-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:40:59] (03PS2) 10Elukey: Fix datetime-related warnings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293719 [13:41:07] (03CR) 10Blake: [C:03+2] site.pp: re-add mc1054 for trixie testing [puppet] - 10https://gerrit.wikimedia.org/r/1293656 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [13:41:07] FIRING: [3x] ProbeDown: Service ml-cache2001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:41:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-a8-codfw:et-0/0/1 (Core: lsw1-a2-codfw:et-0/0/54 {#230403800026}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-a8-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:42:18] (03PS3) 10Elukey: Fix datetime-related and pytest warnings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293719 [13:44:14] FIRING: JobUnavailable: Reduced availability for job es_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:46:07] FIRING: [4x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:46:24] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2167.codfw.wmnet with OS trixie [13:47:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P93012 and previous config saved to /var/cache/conftool/dbconfig/20260526-134703-fceratto.json [13:47:19] (03CR) 10Hnowlan: [C:03+1] trafficserver: Default most APIs to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1293699 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [13:47:33] (03CR) 10Hnowlan: [C:03+1] trafficserver: Route /media/math directly to restbase [puppet] - 10https://gerrit.wikimedia.org/r/1293703 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [13:47:56] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282 (10MoritzMuehlenhoff) 03NEW [13:48:22] !log UTC afternoon backport+config window done [13:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:49:13] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [13:49:42] !log marostegui@cumin1003 START - Cookbook sre.mysql.decommission [13:49:46] !ack [13:49:52] VictorOps API error [13:49:57] great [13:49:59] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts pc1013.eqiad.wmnet [13:50:11] fabfur: I just acked it in the web UI, possibly just a race [13:50:12] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [13:50:36] moritzm: ok, it's cp1100? [13:50:45] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 33, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:50:51] RECOVERY - BFD status on ssw1-a8-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:51:14] (03CR) 10Ladsgroup: [C:03+2] Site info should output thumblimits as array [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293710 (https://phabricator.wikimedia.org/T427066) (owner: 10Ladsgroup) [13:51:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - ssw1-a8-codfw:et-0/0/1 (Core: lsw1-a2-codfw:et-0/0/54 {#230403800026}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-a8-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:52:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:52:36] codfw A2 is back up [13:52:47] monitoring it a bit then will repool services [13:53:08] (03CR) 10Kamila Součková: [C:03+1] trafficserver: Route /media/math directly to restbase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1293703 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [13:53:22] !log drop flaggedrevs tables on cawikinews (T423577) [13:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:26] T423577: Undeploy FlaggedRevs from Wikinews and drop FlaggedRevs tables - https://phabricator.wikimedia.org/T423577 [13:53:49] (03CR) 10Kamila Součková: [C:03+1] trafficserver: Route /media/math directly to restbase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1293703 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [13:54:00] (03CR) 10Hnowlan: [C:03+1] "Huge moment!" [puppet] - 10https://gerrit.wikimedia.org/r/1293704 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [13:54:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:54:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:54:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [13:54:14] RESOLVED: JobUnavailable: Reduced availability for job es_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:54:44] FIRING: RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32390538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:55:00] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2167: Migration of db2167.codfw.wmnet completed [13:55:10] (03CR) 10Kamila Součková: [C:03+1] trafficserver: Remove all gateway-check config [puppet] - 10https://gerrit.wikimedia.org/r/1293704 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [13:55:12] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [13:55:22] I think sessionstore might have had some issues during that [13:55:30] (03CR) 10Federico Ceratto: sre.mysql.upgrade: fix looping logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [13:55:39] RESOLVED: CoreBGPDown: Core BGP session down between ssw1-a8-codfw and lsw1-a2-codfw (10.192.252.4) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-a8-codfw:9804&var-bgp_group=EVPN_IBGP&var-bgp_neighbor=lsw1-a2-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDow [13:55:45] don't see any errors on the service side but that session loss spike lines up [13:55:54] (03PS2) 10Arnaudb: trafficserver: add a map for gitlab as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [13:55:55] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:01] I see a spike on save edits failures [13:56:01] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1054.eqiad.wmnet with OS trixie [13:56:06] not ongoing anymore [13:56:07] RESOLVED: [4x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:56:56] actually, still ongoing [13:57:04] fabfur: why cp1100? I don't see anything unusual on it? [13:57:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P93014 and previous config saved to /var/cache/conftool/dbconfig/20260526-135711-fceratto.json [13:57:12] https://grafana.wikimedia.org/goto/dfn8f7v25jojkc?orgId=1 [13:57:15] (03PS4) 10Elukey: Fix datetime-related and pytest warnings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293719 [13:57:33] moritzm: it alerted for high load earlier [13:57:36] session_loss, so yes, probably to sessionstore [13:57:50] (03PS1) 10Arnaudb: gitlab: add envoy on Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1293722 (https://phabricator.wikimedia.org/T425441) [13:58:01] (03PS4) 10Arnaudb: trafficserver: add a map for gitlab as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [13:58:13] but could be unrelated [13:58:20] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission pc1013 [puppet] - 10https://gerrit.wikimedia.org/r/1293702 (https://phabricator.wikimedia.org/T427190) (owner: 10Marostegui) [13:58:49] yeah, there was a slight increase, but all within norms for a cache host https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-1h&to=now&timezone=utc&var-server=cp1100&var-datasource=000000026&var-cluster=cache_text&refresh=5m [13:59:44] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32390538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:00:04] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1400) [14:00:32] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1289998 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [14:00:41] (03CR) 10Elukey: [C:03+1] pki:multirootca: Switch to nftables on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1289355 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [14:00:50] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [14:01:33] (03CR) 10Ssingh: [C:03+1] hiera: using haproxy-awslc on cp3074,cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1289998 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [14:02:17] (03CR) 10Alex.sanford: Enforce 2FA requirements for phase 3 groups (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [14:03:12] (03Merged) 10jenkins-bot: Site info should output thumblimits as array [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293710 (https://phabricator.wikimedia.org/T427066) (owner: 10Ladsgroup) [14:04:36] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [14:04:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [14:04:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:04:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1013.eqiad.wmnet [14:04:59] !log marostegui@cumin1003 Removing pc1013 from zarcillo T427190 [14:04:59] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.decommission (exit_code=99) [14:05:02] (03PS1) 10Eevans: linked-artifacts: configure staging for topics lambda [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293726 (https://phabricator.wikimedia.org/T414112) [14:05:04] T427190: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190 [14:06:12] (03CR) 10Marostegui: "I've ran the cookbook and it failed at the end with: https://phabricator.wikimedia.org/P93015" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [14:07:01] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1293710|Site info should output thumblimits as array (T427066)]] [14:07:06] T427066: Media dialog in VisualEditor uses invalid imageinfo API parameter - https://phabricator.wikimedia.org/T427066 [14:07:19] (03CR) 10Marostegui: sre.mysql.upgrade: fix looping logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [14:07:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T426633)', diff saved to https://phabricator.wikimedia.org/P93016 and previous config saved to /var/cache/conftool/dbconfig/20260526-140718-fceratto.json [14:07:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [14:07:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T426633)', diff saved to https://phabricator.wikimedia.org/P93017 and previous config saved to /var/cache/conftool/dbconfig/20260526-140748-fceratto.json [14:08:18] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2003.codfw.wmnet,wikikube-worker[2248-2250].codfw.wmnet [14:08:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2003.codfw.wmnet,wikikube-worker[2248-2250].codfw.wmnet [14:08:44] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190#11955387 (10Marostegui) a:05Marostegui→03None [14:08:50] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190#11955396 (10Marostegui) Ready for DC-Ops [14:08:54] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1293710|Site info should output thumblimits as array (T427066)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:09:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [14:09:23] !log restoring lvs2011 as primary (T426199) [14:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:27] T426199: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 [14:09:33] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [14:10:15] (03CR) 10Bartosz Wójtowicz: [C:03+1] linked-artifacts: configure staging for topics lambda [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293726 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:10:27] !log fabfur@cumin1003 START - Cookbook sre.hosts.remove-downtime for lvs2011.codfw.wmnet [14:10:28] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2011.codfw.wmnet [14:10:30] PROBLEM - pybal on lvs2011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:10:30] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:10:31] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1054.eqiad.wmnet with reason: host reimage [14:10:37] (03CR) 10Eevans: [C:03+2] linked-artifacts: configure staging for topics lambda [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293726 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:10:55] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:25] (03CR) 10Mszwarc: Enforce 2FA requirements for phase 3 groups (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [14:11:30] RECOVERY - pybal on lvs2011 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:11:30] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:12:28] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [14:13:41] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293710|Site info should output thumblimits as array (T427066)]] (duration: 06m 40s) [14:13:46] T427066: Media dialog in VisualEditor uses invalid imageinfo API parameter - https://phabricator.wikimedia.org/T427066 [14:14:22] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp2043.* [14:14:36] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1054.eqiad.wmnet with reason: host reimage [14:14:45] !log ayounsi@cumin1003 START - Cookbook sre.mysql.pool pool db2223: switch maintenance [14:14:49] !log repooled cp2043 (T426199) [14:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:53] T426199: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 [14:15:08] (03CR) 10Reedy: Periodic jobs: add demote_ineligible_users (and _central_ counterpart) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285315 (https://phabricator.wikimedia.org/T425396) (owner: 10Mszwarc) [14:16:04] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [14:16:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T426633)', diff saved to https://phabricator.wikimedia.org/P93021 and previous config saved to /var/cache/conftool/dbconfig/20260526-141628-fceratto.json [14:16:34] (03PS3) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) [14:17:32] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Unrack old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427283 (10Papaul) 03NEW [14:17:56] (03PS1) 10Eevans: linked-artifacts: fix typo-ed configuration param [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293732 (https://phabricator.wikimedia.org/T414112) [14:18:01] (03CR) 10Alex.sanford: Enforce 2FA requirements for phase 3 groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [14:18:19] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Decommision old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427246#11955437 (10Papaul) [14:18:35] !log restarting mediabackups@codfw after maintenance on a codfw backup media storage server T426199 [14:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:45] (03CR) 10Federico Ceratto: "Updated with a fix and functional tests." [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [14:18:59] (03PS2) 10Alex.sanford: Enforce 2FA requirements for phase 3 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) [14:19:08] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for backup2015.codfw.wmnet,db2197.codfw.wmnet [14:19:09] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for backup2015.codfw.wmnet,db2197.codfw.wmnet [14:19:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [14:19:53] (03PS4) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) [14:20:02] (03CR) 10Alex.sanford: Enforce 2FA requirements for phase 3 groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [14:21:18] (03CR) 10Federico Ceratto: "I left a `TODO check puppet after the merge` in case we want to check if the host is gone from dbctl before calling sre.hosts.decommission" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [14:21:25] (03CR) 10Eevans: [C:03+2] linked-artifacts: fix typo-ed configuration param [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293732 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:22:34] (03CR) 10Mszwarc: [C:03+1] Enforce 2FA requirements for phase 3 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [14:24:29] (03Merged) 10jenkins-bot: linked-artifacts: fix typo-ed configuration param [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293732 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:24:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc1014: Rack maintenance completed [14:24:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [14:24:50] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [14:24:50] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1014: Rack maintenance completed [14:25:10] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11955479 (10RobH) a:03RobH [14:25:26] (03CR) 10CI reject: [V:04-1] cookbooks/sre/mysql/decommission: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [14:25:46] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [14:26:01] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [14:26:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P93023 and previous config saved to /var/cache/conftool/dbconfig/20260526-142636-fceratto.json [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1430) [14:31:23] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1054.eqiad.wmnet with OS trixie [14:36:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P93026 and previous config saved to /var/cache/conftool/dbconfig/20260526-143643-fceratto.json [14:38:51] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: align bastion_hosts puppet type [puppet] - 10https://gerrit.wikimedia.org/r/1291946 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [14:39:41] (03PS4) 10AOkoth: phabricator: replace phab2002 with phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1278521 (https://phabricator.wikimedia.org/T423727) [14:40:30] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2167: Migration of db2167.codfw.wmnet completed [14:40:31] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:41:01] (03CR) 10Jelto: [C:03+1] "lgtm! This change will also install envoy on the wmcs test instance. We might want to set some reasonable values there as well or disable " [puppet] - 10https://gerrit.wikimedia.org/r/1293722 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [14:43:11] (03PS1) 10Eevans: linked-artifacts: add egress rule for inference service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293742 (https://phabricator.wikimedia.org/T414112) [14:43:19] (03CR) 10AOkoth: [C:03+2] phabricator: replace phab2002 with phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1278521 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [14:43:32] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=search,name=codfw [14:44:09] jouncebot: nowandnext [14:44:09] For the next 0 hour(s) and 15 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1430) [14:44:09] In 0 hour(s) and 15 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1500) [14:45:21] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=wdqs-scholarly,name=codfw [14:45:32] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-scholarly,name=codfw [14:46:00] (03CR) 10Eevans: [C:03+2] linked-artifacts: add egress rule for inference service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293742 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:46:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T426633)', diff saved to https://phabricator.wikimedia.org/P93029 and previous config saved to /var/cache/conftool/dbconfig/20260526-144651-fceratto.json [14:47:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [14:47:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T426633)', diff saved to https://phabricator.wikimedia.org/P93030 and previous config saved to /var/cache/conftool/dbconfig/20260526-144718-fceratto.json [14:47:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2029.codfw.wmnet to cluster codfw and group A [14:48:37] (03Merged) 10jenkins-bot: linked-artifacts: add egress rule for inference service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293742 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:49:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2029.codfw.wmnet to cluster codfw and group A [14:49:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to cluster codfw and group A [14:49:36] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [14:49:42] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2221: Rack maintenance completed [14:49:47] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [14:50:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190#11955622 (10Jclark-ctr) a:03Jclark-ctr [14:51:07] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2222: Rack maintenance completed [14:51:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2030.codfw.wmnet to cluster codfw and group A [14:51:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190#11955625 (10Jclark-ctr) pc1013 C5 U26 [14:51:23] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190#11955627 (10Jclark-ctr) [14:52:02] !log remove ganeti1025 from eqiad Ganeti cluster T424680 [14:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:06] T424680: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680 [14:52:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [14:52:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11955646 (10MoritzMuehlenhoff) [14:53:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11955647 (10ops-monitoring-bot) Draining ganeti1026.eqiad.wmnet of running VMs [14:54:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [14:54:22] PROBLEM - ganeti-noded running on ganeti1025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:54:22] PROBLEM - ganeti-confd running on ganeti1025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [14:54:50] FIRING: ProbeDown: Service ganeti1025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [14:55:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11955675 (10ops-monitoring-bot) Draining ganeti1026.eqiad.wmnet of running VMs [14:55:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T426633)', diff saved to https://phabricator.wikimedia.org/P93033 and previous config saved to /var/cache/conftool/dbconfig/20260526-145538-fceratto.json [14:55:43] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2222.codfw.wmnet [14:55:44] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2222.codfw.wmnet [14:55:49] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2221.codfw.wmnet [14:55:50] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2221.codfw.wmnet [14:56:08] (03PS1) 10Muehlenhoff: Add urldownloader[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/1293743 (https://phabricator.wikimedia.org/T427282) [14:56:51] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2196: Rack maintenance completed [15:00:00] 06SRE: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11955698 (10elukey) @brouberol would it be ok to just add bash files in puppet listing the commands to add ACLs for each cluster? Just to have something right now while we work on the tool (that seems more long-term). I'd like t... [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1500). [15:00:16] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2223: switch maintenance [15:00:41] 06SRE: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11955703 (10brouberol) Yep, sounds fair! [15:01:51] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy [15:01:54] !log uploading prometheus-memcached-exporter_0.16.0-1_amd64 on apt1002 [15:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:20] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy [15:02:29] !log brennen@deploy1003 Started deploy [phabricator/deployment@939557b]: deploy phab2002 for T427286 [15:02:33] T427286: Deploy Phab/Phorge 2026-05-26 - https://phabricator.wikimedia.org/T427286 [15:03:14] !log brennen@deploy1003 Finished deploy [phabricator/deployment@939557b]: deploy phab2002 for T427286 (duration: 00m 45s) [15:03:39] !log brennen@deploy1003 Started deploy [phabricator/deployment@939557b]: deploy phab1004 for T427286 [15:04:10] (03CR) 10Marostegui: "This wiki doesn't exist yet right?" [puppet] - 10https://gerrit.wikimedia.org/r/1292346 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [15:04:18] !log brennen@deploy1003 Finished deploy [phabricator/deployment@939557b]: deploy phab1004 for T427286 (duration: 00m 39s) [15:04:59] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2221: Rack maintenance completed [15:05:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P93037 and previous config saved to /var/cache/conftool/dbconfig/20260526-150546-fceratto.json [15:06:24] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2222: Rack maintenance completed [15:06:47] (03CR) 10Ladsgroup: "yup but it'll be private so I want to set the filter before it's created." [puppet] - 10https://gerrit.wikimedia.org/r/1292346 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [15:07:16] (03CR) 10JMeybohm: [V:03+2 C:03+2] kube-state-metrics: Update to v2.14.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293673 (https://phabricator.wikimedia.org/T388387) (owner: 10JMeybohm) [15:07:55] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search,name=codfw [15:10:52] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2196.codfw.wmnet [15:10:53] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2196.codfw.wmnet [15:12:07] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2196: Rack maintenance completed [15:15:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P93040 and previous config saved to /var/cache/conftool/dbconfig/20260526-151552-fceratto.json [15:18:23] (03CR) 10JMeybohm: [C:03+2] kube-state-metrics: Update to default to v2.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293675 (https://phabricator.wikimedia.org/T388387) (owner: 10JMeybohm) [15:22:09] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:22:38] (03PS2) 10JavierMonton: image: Flink 2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293664 (https://phabricator.wikimedia.org/T412978) [15:22:39] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:22:40] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:23:08] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:23:34] (03PS1) 10Clément Goubert: cache::text: pipe caching for lw streaming API [puppet] - 10https://gerrit.wikimedia.org/r/1293746 (https://phabricator.wikimedia.org/T425680) [15:23:52] (03CR) 10JavierMonton: "As discussed on Slack, I changed the approach to use the jre-21 on bookworm. I've checked it and the symlink is working properly in this v" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293664 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [15:24:06] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:24:09] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:24:10] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:24:13] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:24:50] RESOLVED: ProbeDown: Service ganeti1025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:29] (03PS2) 10Blake: Update to kubernetes v1.31.14. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1293087 (https://phabricator.wikimedia.org/T427065) [15:25:51] (03CR) 10Blake: Update to kubernetes v1.31.14. (031 comment) [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1293087 (https://phabricator.wikimedia.org/T427065) (owner: 10Blake) [15:26:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T426633)', diff saved to https://phabricator.wikimedia.org/P93041 and previous config saved to /var/cache/conftool/dbconfig/20260526-152559-fceratto.json [15:26:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [15:26:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T426633)', diff saved to https://phabricator.wikimedia.org/P93042 and previous config saved to /var/cache/conftool/dbconfig/20260526-152629-fceratto.json [15:27:03] (03Merged) 10jenkins-bot: kube-state-metrics: Update to default to v2.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293675 (https://phabricator.wikimedia.org/T388387) (owner: 10JMeybohm) [15:27:09] (03PS1) 10JMeybohm: Remve pinned chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293750 (https://phabricator.wikimedia.org/T423251) [15:28:04] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:28:07] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:28:08] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:28:11] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:28:55] (03CR) 10CI reject: [V:04-1] Remve pinned chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293750 (https://phabricator.wikimedia.org/T423251) (owner: 10JMeybohm) [15:29:10] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:29:13] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:29:14] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:29:17] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:30:08] (03PS1) 10JMeybohm: Update kube-state-metrics to v2.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293751 (https://phabricator.wikimedia.org/T388387) [15:30:13] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:30:15] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:30:17] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:30:20] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:31:20] FIRING: ProbeDown: Service ganeti1025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:32:45] (03CR) 10JMeybohm: "I'll deploy to staging first but since it should be a small impact change I thought it might not be worth it creating separate version ove" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293751 (https://phabricator.wikimedia.org/T388387) (owner: 10JMeybohm) [15:33:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T426633)', diff saved to https://phabricator.wikimedia.org/P93043 and previous config saved to /var/cache/conftool/dbconfig/20260526-153357-fceratto.json [15:34:35] (03PS2) 10JMeybohm: Remve pinned chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293750 (https://phabricator.wikimedia.org/T423251) [15:41:33] (03CR) 10CI reject: [V:04-1] Remve pinned chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293750 (https://phabricator.wikimedia.org/T423251) (owner: 10JMeybohm) [15:44:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P93044 and previous config saved to /var/cache/conftool/dbconfig/20260526-154405-fceratto.json [15:44:40] jouncebot: now [15:44:40] For the next 0 hour(s) and 15 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1500) [15:44:50] (03PS12) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) [15:44:50] (03PS3) 10Kamila Součková: Remove k8s version from all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) [15:46:28] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad [15:46:46] (03PS1) 10Kamila Součková: CI: Fix race condition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293757 (https://phabricator.wikimedia.org/T388969) [15:47:52] (03PS3) 10JMeybohm: Remove pinned chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293750 (https://phabricator.wikimedia.org/T423251) [15:48:42] 10SRE-tools, 10Ceph, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Enhacements to wmcs.ceph.roll_reboot_osds - https://phabricator.wikimedia.org/T427295 (10Andrew) 03NEW [15:50:27] (03PS11) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [15:51:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in ms-be2089 - https://phabricator.wikimedia.org/T427266#11955954 (10Jhancock.wm) the server is under warranty AND actually has idrac readout. I've replaced the drive but i have a replacement coming from Dell to replace the one I used fro... [15:51:20] RESOLVED: ProbeDown: Service ganeti1025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:23] (03CR) 10JMeybohm: [C:03+2] Update kube-state-metrics to v2.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293751 (https://phabricator.wikimedia.org/T388387) (owner: 10JMeybohm) [15:54:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P93045 and previous config saved to /var/cache/conftool/dbconfig/20260526-155413-fceratto.json [15:54:23] (03CR) 10Ilias Sarantopoulos: cache::text: pipe caching for lw streaming API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1293746 (https://phabricator.wikimedia.org/T425680) (owner: 10Clément Goubert) [15:54:42] (03CR) 10CI reject: [V:04-1] Remove pinned chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293750 (https://phabricator.wikimedia.org/T423251) (owner: 10JMeybohm) [15:55:22] !log aokoth@deploy1003 Started deploy [phabricator/deployment@939557b]: deploy phab2003 - T423727 [15:55:27] T423727: replace phab2002 with phab2003 - https://phabricator.wikimedia.org/T423727 [15:55:44] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:55:45] !log aokoth@deploy1003 Finished deploy [phabricator/deployment@939557b]: deploy phab2003 - T423727 (duration: 00m 22s) [15:55:47] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:55:48] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:55:51] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:56:06] (03CR) 10Ilias Sarantopoulos: cache::text: pipe caching for lw streaming API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1293746 (https://phabricator.wikimedia.org/T425680) (owner: 10Clément Goubert) [15:57:20] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:57:23] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:57:24] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [15:57:27] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:59:55] (03PS4) 10JMeybohm: Remove pinned chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293750 (https://phabricator.wikimedia.org/T423251) [16:00:04] jhathaway and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1600). nyaa~ [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:12] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:00:15] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:00:16] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:00:19] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:00:41] (03Merged) 10jenkins-bot: Update kube-state-metrics to v2.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293751 (https://phabricator.wikimedia.org/T388387) (owner: 10JMeybohm) [16:02:50] !log aokoth@deploy1003 Started deploy [phabricator/deployment@939557b]: deploy phab2003 - T423727 [16:02:55] T423727: replace phab2002 with phab2003 - https://phabricator.wikimedia.org/T423727 [16:03:18] !log aokoth@deploy1003 Finished deploy [phabricator/deployment@939557b]: deploy phab2003 - T423727 (duration: 00m 28s) [16:03:51] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:03:54] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:03:55] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:03:58] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:04:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T426633)', diff saved to https://phabricator.wikimedia.org/P93046 and previous config saved to /var/cache/conftool/dbconfig/20260526-160420-fceratto.json [16:04:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [16:04:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T426633)', diff saved to https://phabricator.wikimedia.org/P93047 and previous config saved to /var/cache/conftool/dbconfig/20260526-160450-fceratto.json [16:06:16] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:06:19] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:06:20] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:06:23] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:07:01] (03CR) 10Kamila Součková: "That's what I get for the `sed -i` '^^ Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [16:07:43] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad [16:08:11] (03CR) 10CI reject: [V:04-1] Remove k8s version from all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [16:09:14] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:20] (03PS2) 10Clément Goubert: cache::text: pipe caching for lw streaming API [puppet] - 10https://gerrit.wikimedia.org/r/1293746 (https://phabricator.wikimedia.org/T425680) [16:10:22] (03CR) 10Clément Goubert: cache::text: pipe caching for lw streaming API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1293746 (https://phabricator.wikimedia.org/T425680) (owner: 10Clément Goubert) [16:10:56] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:10:58] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:10:59] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:11:03] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:11:06] (03PS2) 10Daniel Kinzler: rest-gateway: tighten rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289992 (https://phabricator.wikimedia.org/T424821) [16:11:19] (03CR) 10Daniel Kinzler: rest-gateway: tighten rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289992 (https://phabricator.wikimedia.org/T424821) (owner: 10Daniel Kinzler) [16:12:42] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1293746 (https://phabricator.wikimedia.org/T425680) (owner: 10Clément Goubert) [16:13:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T426633)', diff saved to https://phabricator.wikimedia.org/P93049 and previous config saved to /var/cache/conftool/dbconfig/20260526-161328-fceratto.json [16:13:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2089.codfw.wmnet [16:13:57] (03CR) 10Ssingh: "@tcipriani@wikimedia.org: This will need a +1 from you please." [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [16:17:17] (03CR) 10CI reject: [V:04-1] CI: Fix race condition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293757 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [16:17:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:21:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in ms-be2089 - https://phabricator.wikimedia.org/T427266#11956168 (10MatthewVernon) @Jhancock.wm this is an odd one, but - can you pull the drive and check it doesn't have writes disabled somehow, please? I've tried rebooting, but still:... [16:23:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P93050 and previous config saved to /var/cache/conftool/dbconfig/20260526-162336-fceratto.json [16:23:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in ms-be2089 - https://phabricator.wikimedia.org/T427266#11956174 (10MatthewVernon) Oh, actually, it's showing up as connected by SAS not SATA, too: ` 64:11 26 Onln - 7.277 TB SAS HDD N N 512B ST8000NM024B U JBOD ` c... [16:24:14] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:30:47] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:30:49] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:30:51] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:30:54] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:31:08] (03CR) 10FNegri: sre.mysql.upgrade: fix looping logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [16:33:41] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_codfw and A:cp [16:33:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P93051 and previous config saved to /var/cache/conftool/dbconfig/20260526-163344-fceratto.json [16:33:58] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_codfw and A:cp [16:34:14] RESOLVED: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:19] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:34:22] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:34:23] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:34:26] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:34:56] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:35:18] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:36:00] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:36:02] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:36:03] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:36:07] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:36:41] (03CR) 10Kamila Součková: "I promise it passes except for the thing that's broken in the main branch rn :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293757 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [16:36:42] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:37:00] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:37:15] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:37:31] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:37:33] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:37:34] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:37:36] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:37:38] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:37:52] (03CR) 10Kamila Součková: [C:03+2] mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [16:38:35] (03CR) 10Kamila Součková: [C:03+2] "Done, good point. I promise it passes now, except somebody just merged something bad into upstream :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [16:39:11] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:39:14] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:39:15] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:39:18] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:40:14] !log reboot lvs 101[345].eqiad.wmnet [16:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:19] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:40:22] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:40:22] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [16:40:23] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:40:26] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:40:55] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:41:01] PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:20] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2044.codfw.wmnet [16:42:29] RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [16:42:42] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2043.codfw.wmnet [16:43:39] (03PS4) 10Andrew Bogott: designate: remove leftover mcrouter code [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646) [16:43:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T426633)', diff saved to https://phabricator.wikimedia.org/P93052 and previous config saved to /var/cache/conftool/dbconfig/20260526-164352-fceratto.json [16:44:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11956308 (10ayounsi) [16:44:12] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dse-k8s-wdqs1002 to eqiad - jclark@cumin1003" [16:44:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2209.codfw.wmnet with reason: Maintenance [16:44:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dse-k8s-wdqs1002 to eqiad - jclark@cumin1003" [16:44:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:44:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T426633)', diff saved to https://phabricator.wikimedia.org/P93053 and previous config saved to /var/cache/conftool/dbconfig/20260526-164421-fceratto.json [16:44:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11956312 (10ayounsi) [16:44:57] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:44:59] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:45:00] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:45:04] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:45:32] (03CR) 10Atsuko: [C:03+2] image: Flink 2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293664 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [16:45:43] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1036.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:45:44] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:45:54] (03CR) 10Atsuko: [C:03+1] image: Flink 2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293664 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [16:45:55] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:02] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [16:47:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in ms-be2089 - https://phabricator.wikimedia.org/T427266#11956349 (10Jhancock.wm) that is weird. i'm gonna go check it out. might take a few. [16:50:47] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [16:50:50] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [16:50:51] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [16:50:54] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [16:52:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T426633)', diff saved to https://phabricator.wikimedia.org/P93054 and previous config saved to /var/cache/conftool/dbconfig/20260526-165240-fceratto.json [16:52:42] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1036.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:52:53] (03CR) 10Blake: [C:03+1] Remove pinned chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293750 (https://phabricator.wikimedia.org/T423251) (owner: 10JMeybohm) [16:54:40] (03PS12) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [16:55:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1038.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:57:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:00:03] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1700) [17:00:06] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:00:07] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:00:11] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:01:31] (03PS1) 10ArielGlenn: add new members of mw release working group to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1293769 (https://phabricator.wikimedia.org/T423255) [17:01:54] (03PS1) 10Kamila Součková: device-analytics: fix indentation error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293770 (https://phabricator.wikimedia.org/T425310) [17:02:27] (03CR) 10CI reject: [V:04-1] add new members of mw release working group to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1293769 (https://phabricator.wikimedia.org/T423255) (owner: 10ArielGlenn) [17:02:44] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:02:47] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:02:48] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:02:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P93055 and previous config saved to /var/cache/conftool/dbconfig/20260526-170247-fceratto.json [17:02:51] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:04:05] (03PS2) 10ArielGlenn: add new members of mw release working group to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1293769 (https://phabricator.wikimedia.org/T423255) [17:04:13] (03Merged) 10jenkins-bot: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [17:04:14] (03Abandoned) 10Bking: relforge: Switch to an OCI-image based profile [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [17:04:23] (03CR) 10Clément Goubert: [C:03+1] device-analytics: fix indentation error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293770 (https://phabricator.wikimedia.org/T425310) (owner: 10Kamila Součková) [17:05:13] jouncebot: nowandnext [17:05:13] For the next 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1700) [17:05:13] In 2 hour(s) and 54 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T2000) [17:05:14] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:05:17] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:05:17] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in ms-be2089 - https://phabricator.wikimedia.org/T427266#11956416 (10Jhancock.wm) @MatthewVernon that was my bad. i didn't double check the disk before i installed. just assumed it was the right kind cause it was in the drawer i kept them... [17:05:18] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:05:21] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:06:34] (03CR) 10Kamila Součková: [C:03+2] device-analytics: fix indentation error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293770 (https://phabricator.wikimedia.org/T425310) (owner: 10Kamila Součková) [17:07:50] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1038.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:08:38] (03Merged) 10jenkins-bot: device-analytics: fix indentation error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293770 (https://phabricator.wikimedia.org/T425310) (owner: 10Kamila Součková) [17:09:40] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:09:43] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:09:44] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:09:47] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:11:12] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:11:15] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:11:16] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:11:19] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:12:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P93056 and previous config saved to /var/cache/conftool/dbconfig/20260526-171255-fceratto.json [17:13:06] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:13:09] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:13:10] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:13:13] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:13:29] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1038.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:14:31] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:14:34] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:14:35] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:14:38] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:15:14] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: T426585 - bking@cumin2002 [17:16:42] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:16:45] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:16:46] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:16:49] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:16:59] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:17:01] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:17:03] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:17:06] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:17:21] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:17:24] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:17:25] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:17:28] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:18:49] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1038.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:20:18] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2046.codfw.wmnet [17:20:59] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [17:21:42] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2045.codfw.wmnet [17:22:29] (03PS4) 10Kamila Součková: Remove k8s version from all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) [17:22:33] (03PS2) 10Kamila Součková: CI: Fix race condition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293757 (https://phabricator.wikimedia.org/T388969) [17:23:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T426633)', diff saved to https://phabricator.wikimedia.org/P93057 and previous config saved to /var/cache/conftool/dbconfig/20260526-172303-fceratto.json [17:23:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2227.codfw.wmnet with reason: Maintenance [17:23:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T426633)', diff saved to https://phabricator.wikimedia.org/P93058 and previous config saved to /var/cache/conftool/dbconfig/20260526-172332-fceratto.json [17:24:51] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dse-k8s-wdqs1001 to eqiad - jclark@cumin1003" [17:24:54] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:24:55] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dse-k8s-wdqs1001 to eqiad - jclark@cumin1003" [17:24:55] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:24:57] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:24:58] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:25:01] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:25:42] (03PS13) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [17:26:08] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:27:14] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:27:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:27:29] (03PS1) 10Dreamy Jazz: Enable hCaptcha for VisualEditor and MobileFrontend for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293779 (https://phabricator.wikimedia.org/T425940) [17:27:51] (03PS2) 10Dreamy Jazz: Enable hCaptcha for VisualEditor and MobileFrontend for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293779 (https://phabricator.wikimedia.org/T425940) [17:28:36] jouncebot: nowandnext [17:28:36] For the next 0 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T1700) [17:28:36] In 2 hour(s) and 31 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T2000) [17:28:44] Any objection for me using scap during this window? [17:31:06] (03CR) 10Kamila Součková: "This seems to slow the tests down enough that they may time out. I think slower but not flaky is still an improvement, but we may need to " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293757 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [17:31:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T426633)', diff saved to https://phabricator.wikimedia.org/P93059 and previous config saved to /var/cache/conftool/dbconfig/20260526-173109-fceratto.json [17:31:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293779 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [17:32:25] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:28] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:32:30] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:32:31] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:32:35] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:33:20] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:33:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [17:33:23] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:33:24] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:33:27] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:34:22] (03Merged) 10jenkins-bot: Enable hCaptcha for VisualEditor and MobileFrontend for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293779 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [17:34:46] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1293779|Enable hCaptcha for VisualEditor and MobileFrontend for group0 (T425940)]] [17:34:50] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [17:35:13] (03PS14) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [17:35:59] (03CR) 10Ssingh: [C:03+1] cache::text: pipe caching for lw streaming API [puppet] - 10https://gerrit.wikimedia.org/r/1293746 (https://phabricator.wikimedia.org/T425680) (owner: 10Clément Goubert) [17:36:23] (03CR) 10Ssingh: [C:03+1] "(caching rule looks good; I haven't verified the path)" [puppet] - 10https://gerrit.wikimedia.org/r/1293746 (https://phabricator.wikimedia.org/T425680) (owner: 10Clément Goubert) [17:36:30] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:36:47] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1293779|Enable hCaptcha for VisualEditor and MobileFrontend for group0 (T425940)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:36:55] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:36:56] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:37:22] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:37:25] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:56] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [17:38:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [17:39:29] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2089.codfw.wmnet [17:41:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P93060 and previous config saved to /var/cache/conftool/dbconfig/20260526-174117-fceratto.json [17:42:08] (03CR) 10MSantos: [C:03+1] "Approved." [puppet] - 10https://gerrit.wikimedia.org/r/1293769 (https://phabricator.wikimedia.org/T423255) (owner: 10ArielGlenn) [17:42:11] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293779|Enable hCaptcha for VisualEditor and MobileFrontend for group0 (T425940)]] (duration: 07m 25s) [17:42:15] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [17:42:25] FIRING: [15x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:25] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T426503#11956654 (10Jclark-ctr) 05Open→03Resolved [17:45:37] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:46:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:48:19] (03PS15) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [17:51:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P93062 and previous config saved to /var/cache/conftool/dbconfig/20260526-175124-fceratto.json [17:51:38] (03CR) 10Dzahn: [C:03+1] lvs7003: add gitlab-ssh and gitlab-https [puppet] - 10https://gerrit.wikimedia.org/r/1291898 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [17:51:40] (03CR) 10Eric Gardner: "I may re-work this to land on testwiki first (to backport later today) and then I can post a follow-up patch to enable on Wikipedia proper" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293701 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [17:52:08] (03CR) 10FNegri: "@fceratto@wikimedia.org I added a new test for the clouddb scenario, this is ready for another round of review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [17:52:25] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1079:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:53:20] (03CR) 10Dzahn: [C:03+1] "agree with Jelto, lgtm but not 100% sure if the cert part will work in cloud or we even want the envoy there. ideally we do though, so let" [puppet] - 10https://gerrit.wikimedia.org/r/1293722 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [17:54:15] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:54:18] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:54:19] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:54:23] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:55:37] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:55:37] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:57:25] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:03] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2048.codfw.wmnet [18:00:05] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch1068 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:00:05] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1073 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:00:43] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2047.codfw.wmnet [18:01:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T426633)', diff saved to https://phabricator.wikimedia.org/P93063 and previous config saved to /var/cache/conftool/dbconfig/20260526-180132-fceratto.json [18:01:58] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [18:02:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2159 (T426633)', diff saved to https://phabricator.wikimedia.org/P93064 and previous config saved to /var/cache/conftool/dbconfig/20260526-180205-fceratto.json [18:02:29] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:11] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch1098 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:05:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in ms-be2089 - https://phabricator.wikimedia.org/T427266#11956713 (10MatthewVernon) @Jhancock.wm new-new drive looks good now, thanks :) [18:07:06] (03CR) 10Ssingh: "Hi. Thanks for waiting. We discussed this in the Traffic meeting today, so following up on that." [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [18:07:23] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11956719 (10wiki_willy) In regards to buying a new CPU - we don't have any more budget available for FY25-26, but I'm ok with going over budget if this is the best route forward. We'll have four addi... [18:07:25] FIRING: [19x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T426633)', diff saved to https://phabricator.wikimedia.org/P93065 and previous config saved to /var/cache/conftool/dbconfig/20260526-180915-fceratto.json [18:10:05] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch1068 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:10:05] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1073 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:10:13] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1100 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:12:03] (03CR) 10Dzahn: "The IPs do not resolve yet in DNS - that's because they are still in state "reserved" in netbox. Should I activate them now? hmmm, trying " [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [18:12:25] FIRING: [16x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:11] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [18:15:11] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch1098 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:15:13] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [18:15:14] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [18:15:18] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [18:17:25] FIRING: [21x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P93066 and previous config saved to /var/cache/conftool/dbconfig/20260526-181923-fceratto.json [18:20:05] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1074 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:20:07] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1075 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:20:13] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1100 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:20:58] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [18:21:01] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [18:21:02] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [18:21:05] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [18:24:34] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:25:14] (03PS5) 10Andrew Bogott: test_cookbook.py: Allow recording tests on invoked cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/1290858 [18:27:09] (03PS1) 10Scott French: aptrepo: add component/php83 to bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1293789 (https://phabricator.wikimedia.org/T427312) [18:27:11] (03PS1) 10Scott French: package_builder: Use DIST in the D04php hook [puppet] - 10https://gerrit.wikimedia.org/r/1293790 (https://phabricator.wikimedia.org/T427312) [18:27:25] FIRING: [22x] SystemdUnitFailed: push_cross_cluster_settings_9600.service on cirrussearch1073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:07] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: activate_gitlab-lb_magru-v4 - dzahn@cumin2002" [18:29:12] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: activate_gitlab-lb_magru-v4 - dzahn@cumin2002" [18:29:12] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:29:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P93068 and previous config saved to /var/cache/conftool/dbconfig/20260526-182931-fceratto.json [18:30:04] (03PS1) 10Bking: cirrussearch: return relforge to its previous state [puppet] - 10https://gerrit.wikimedia.org/r/1293791 (https://phabricator.wikimedia.org/T427306) [18:30:05] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1074 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:30:07] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1075 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:30:15] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1118 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:30:31] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: T426585 - bking@cumin2002 [18:32:30] FIRING: [30x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:37:47] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2050.codfw.wmnet [18:38:48] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11956827 (10Jhancock.wm) [18:39:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T426633)', diff saved to https://phabricator.wikimedia.org/P93069 and previous config saved to /var/cache/conftool/dbconfig/20260526-183939-fceratto.json [18:40:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [18:40:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T426633)', diff saved to https://phabricator.wikimedia.org/P93070 and previous config saved to /var/cache/conftool/dbconfig/20260526-184009-fceratto.json [18:40:15] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1118 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:40:55] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:41:30] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2049.codfw.wmnet [18:41:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb2014.codfw.wmnet with OS trixie [18:41:40] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11956829 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb2014.codfw.wmnet with OS trixie execu... [18:43:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb2013.codfw.wmnet with OS trixie [18:43:39] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11956849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb2013.codfw.wmnet with OS trixie [18:44:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb2014.codfw.wmnet with OS trixie [18:44:11] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11956852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb2014.codfw.wmnet with OS trixie [18:47:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T426633)', diff saved to https://phabricator.wikimedia.org/P93071 and previous config saved to /var/cache/conftool/dbconfig/20260526-184724-fceratto.json [18:48:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1293791 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [18:52:04] (03CR) 10Muehlenhoff: package_builder: Use DIST in the D04php hook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1293790 (https://phabricator.wikimedia.org/T427312) (owner: 10Scott French) [18:54:28] (03PS1) 10AOkoth: migration: add missing config file [puppet] - 10https://gerrit.wikimedia.org/r/1293793 (https://phabricator.wikimedia.org/T423727) [18:55:03] (03CR) 10CI reject: [V:04-1] migration: add missing config file [puppet] - 10https://gerrit.wikimedia.org/r/1293793 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [18:55:10] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1293791 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [18:55:45] RECOVERY - Confd vcl based reload on cp6016 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:55:45] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:56:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2013.codfw.wmnet with reason: host reimage [18:56:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2014.codfw.wmnet with reason: host reimage [18:57:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P93072 and previous config saved to /var/cache/conftool/dbconfig/20260526-185732-fceratto.json [18:57:45] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:59:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [19:03:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2013.codfw.wmnet with reason: host reimage [19:07:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2014.codfw.wmnet with reason: host reimage [19:07:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P93073 and previous config saved to /var/cache/conftool/dbconfig/20260526-190740-fceratto.json [19:08:06] (03PS2) 10Bking: cirrussearch: return relforge to its previous state [puppet] - 10https://gerrit.wikimedia.org/r/1293791 (https://phabricator.wikimedia.org/T427306) [19:08:10] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1293791 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [19:10:20] (03CR) 10Scott French: package_builder: Use DIST in the D04php hook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1293790 (https://phabricator.wikimedia.org/T427312) (owner: 10Scott French) [19:14:21] (03PS1) 10Aude: Re-enable ReadingLists survey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293794 (https://phabricator.wikimedia.org/T426781) [19:16:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293794 (https://phabricator.wikimedia.org/T426781) (owner: 10Aude) [19:16:35] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2052.codfw.wmnet [19:16:44] (03PS2) 10Aude: Re-enable ReadingLists QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290926 (https://phabricator.wikimedia.org/T426781) [19:17:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T426633)', diff saved to https://phabricator.wikimedia.org/P93074 and previous config saved to /var/cache/conftool/dbconfig/20260526-191748-fceratto.json [19:18:10] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [19:18:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T426633)', diff saved to https://phabricator.wikimedia.org/P93075 and previous config saved to /var/cache/conftool/dbconfig/20260526-191818-fceratto.json [19:19:19] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum5003.eqsin.wmnet with OS trixie [19:19:49] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:20:31] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2051.codfw.wmnet [19:21:24] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:21:46] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11956901 (10RobH) 05Open→03Resolved [19:22:54] jhancock@cumin2002 reimage (PID 1006998) is awaiting input [19:24:56] brett@cumin2002 reimage (PID 1033467) is awaiting input [19:24:57] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:25:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T426633)', diff saved to https://phabricator.wikimedia.org/P93076 and previous config saved to /var/cache/conftool/dbconfig/20260526-192533-fceratto.json [19:27:04] dzahn@cumin2002 netbox (PID 1034828) is awaiting input [19:28:02] jhancock@cumin2002 reimage (PID 1007346) is awaiting input [19:28:13] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1293746 (https://phabricator.wikimedia.org/T425680) (owner: 10Clément Goubert) [19:29:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:29:27] (03PS2) 10Scott French: package_builder: Use @distribution in the D04php hook [puppet] - 10https://gerrit.wikimedia.org/r/1293790 (https://phabricator.wikimedia.org/T427312) [19:29:50] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1293790 (https://phabricator.wikimedia.org/T427312) (owner: 10Scott French) [19:30:08] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: activate_gitlab-lb_IPs - dzahn@cumin2002" [19:33:13] dzahn@cumin2002 netbox (PID 1034828) is awaiting input [19:34:14] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:35:27] 10SRE-tools, 10Ceph, 06cloud-services-team, 10Cloud-VPS, and 2 others: Enhacements to wmcs.ceph.roll_reboot_osds - https://phabricator.wikimedia.org/T427295#11956936 (10Andrew) Part 1 would involve a fair bit of refactoring since we currently use 'ceph node' calls to enumerate osd nodes rather than cumin. [19:35:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: activate_gitlab-lb_IPs - dzahn@cumin2002" [19:35:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:35:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P93077 and previous config saved to /var/cache/conftool/dbconfig/20260526-193541-fceratto.json [19:37:04] (03Abandoned) 10AOkoth: migration: add missing config file [puppet] - 10https://gerrit.wikimedia.org/r/1293793 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [19:38:17] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum5003.eqsin.wmnet with OS trixie [19:39:20] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum5003.eqsin.wmnet with OS trixie [19:40:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:40:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb2013.codfw.wmnet with OS trixie [19:40:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:40:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb2014.codfw.wmnet with OS trixie [19:40:54] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11956971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb2013.codfw.wmnet with OS trixie compl... [19:40:59] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11956972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb2014.codfw.wmnet with OS trixie compl... [19:41:17] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11956975 (10Jhancock.wm) 05Open→03Resolved [19:42:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 31860528 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:43:00] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2028 [19:43:02] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2029 [19:43:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2028 [19:43:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2029 [19:43:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2028.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:44:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2029.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:44:13] brett@cumin2002 reimage (PID 1047234) is awaiting input [19:44:19] (03PS1) 10BCornwall: ncmonitor: Ignore "Beat Wikipedia" domains [puppet] - 10https://gerrit.wikimedia.org/r/1293796 [19:45:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 53096 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:45:23] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum5003.eqsin.wmnet with OS trixie [19:45:46] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1293790 (https://phabricator.wikimedia.org/T427312) (owner: 10Scott French) [19:45:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P93078 and previous config saved to /var/cache/conftool/dbconfig/20260526-194549-fceratto.json [19:47:25] (03CR) 10BCornwall: "I emailed legal and confirmed that these were defensively registered and intended to just be parked. I've created a follow-up CR (I2fa17a9" [puppet] - 10https://gerrit.wikimedia.org/r/1290097 (owner: 10Ncmonitor) [19:47:37] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1290097 (owner: 10Ncmonitor) [19:47:44] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1290098 (owner: 10Ncmonitor) [19:47:53] (03Abandoned) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1290096 (owner: 10Ncmonitor) [19:49:18] (03CR) 10BCornwall: [C:03+2] Remove cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1290006 (https://phabricator.wikimedia.org/T426828) (owner: 10BCornwall) [19:51:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2028.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:51:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2029.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:55:22] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2054.codfw.wmnet [19:55:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T426633)', diff saved to https://phabricator.wikimedia.org/P93079 and previous config saved to /var/cache/conftool/dbconfig/20260526-195557-fceratto.json [19:56:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [19:56:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T426633)', diff saved to https://phabricator.wikimedia.org/P93080 and previous config saved to /var/cache/conftool/dbconfig/20260526-195632-fceratto.json [19:57:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2028.codfw.wmnet with OS trixie [19:57:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#11957040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wdqs2028.codf... [19:58:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2029.codfw.wmnet with OS trixie [19:58:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#11957047 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wdqs2029.codf... [19:59:18] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2053.codfw.wmnet [20:00:04] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T2000). [20:00:05] alexsanford and aude: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] hi [20:02:30] hey [20:03:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T426633)', diff saved to https://phabricator.wikimedia.org/P93081 and previous config saved to /var/cache/conftool/dbconfig/20260526-200333-fceratto.json [20:05:27] aude - I can go ahead with my config change, unless you would like to do yours first? [20:05:43] mine can be bundled. it is for the beta cluster only [20:05:52] or i can do separately afterwards [20:06:25] Ok, shall I just do all three together? [20:07:06] Oh wait, I was looking at the wrong thing. So we each just have one. Want me to do them both? [20:07:08] I only have one change and moved 2 to tomorrow [20:07:17] yes that would be great thanks [20:07:24] Ok will do [20:07:42] We noticed that the survey was disabled on the beta cluster so want it enabled there first [20:07:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [20:07:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293794 (https://phabricator.wikimedia.org/T426781) (owner: 10Aude) [20:08:50] (03Merged) 10jenkins-bot: Enforce 2FA requirements for phase 3 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [20:08:53] (03Merged) 10jenkins-bot: Re-enable ReadingLists survey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293794 (https://phabricator.wikimedia.org/T426781) (owner: 10Aude) [20:09:19] !log alexsanford@deploy1003 Started scap sync-world: Backport for [[gerrit:1293161|Enforce 2FA requirements for phase 3 groups (T423120)]], [[gerrit:1293794|Re-enable ReadingLists survey on beta cluster (T426781)]] [20:09:25] T423120: FY25-26 Q4: Phase 3 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423120 [20:09:26] T426781: Re-enable ReadingLists QuickSurvey - https://phabricator.wikimedia.org/T426781 [20:11:15] !log alexsanford@deploy1003 alexsanford, aude: Backport for [[gerrit:1293161|Enforce 2FA requirements for phase 3 groups (T423120)]], [[gerrit:1293794|Re-enable ReadingLists survey on beta cluster (T426781)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:11:39] nothing to verify for mine (until the change makes its way to the beta cluster) [20:12:59] sounds good [20:13:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P93082 and previous config saved to /var/cache/conftool/dbconfig/20260526-201341-fceratto.json [20:14:22] !log alexsanford@deploy1003 alexsanford, aude: Continuing with deployment [20:14:32] (03PS2) 10Eric Gardner: MultimediaViewer: enable image carousel as a beta feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293701 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [20:14:40] (03CR) 10Eric Gardner: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293701 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [20:18:34] !log alexsanford@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293161|Enforce 2FA requirements for phase 3 groups (T423120)]], [[gerrit:1293794|Re-enable ReadingLists survey on beta cluster (T426781)]] (duration: 09m 14s) [20:18:40] T423120: FY25-26 Q4: Phase 3 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423120 [20:18:40] T426781: Re-enable ReadingLists QuickSurvey - https://phabricator.wikimedia.org/T426781 [20:19:00] done! [20:19:03] thank you! [20:23:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P93083 and previous config saved to /var/cache/conftool/dbconfig/20260526-202349-fceratto.json [20:29:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:31:59] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [20:32:24] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [20:32:25] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [20:32:51] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [20:33:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T426633)', diff saved to https://phabricator.wikimedia.org/P93084 and previous config saved to /var/cache/conftool/dbconfig/20260526-203357-fceratto.json [20:34:14] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:34:21] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2056.codfw.wmnet [20:34:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2221.codfw.wmnet with reason: Maintenance [20:34:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2221 (T426633)', diff saved to https://phabricator.wikimedia.org/P93085 and previous config saved to /var/cache/conftool/dbconfig/20260526-203430-fceratto.json [20:37:33] (03PS3) 10Eric Gardner: MultimediaViewer: enable image carousel as a beta feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293701 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [20:38:09] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2055.codfw.wmnet [20:41:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T426633)', diff saved to https://phabricator.wikimedia.org/P93086 and previous config saved to /var/cache/conftool/dbconfig/20260526-204143-fceratto.json [20:42:41] Hi all. I'm planning to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1293701 during the readers deployment window at 21:00 UTC (may be more like 21:30). This change enables a new beta feature on test wiki only. Just a heads up [20:45:32] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:48:46] (03CR) 10Dzahn: [C:03+2] "I switched the IPs from reserved to active state in netbox and ran the netbox sync. Meaning now they actually exist in DNS." [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [20:50:12] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: activate_gitlab-lb_IPs - dzahn@cumin2002" [20:50:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: activate_gitlab-lb_IPs - dzahn@cumin2002" [20:50:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:50:48] (03CR) 10Dzahn: "I switched the IPs from reserved to active state in netbox and ran the netbox sync. Meaning now they actually exist in DNS." [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [20:51:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P93087 and previous config saved to /var/cache/conftool/dbconfig/20260526-205152-fceratto.json [20:55:39] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on phab2003.codfw.wmnet with reason: WIP [20:56:03] FIRING: ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260526T2100) [21:01:03] RESOLVED: ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:02:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P93088 and previous config saved to /var/cache/conftool/dbconfig/20260526-210159-fceratto.json [21:05:52] (03PS2) 10BCornwall: ncmonitor: Ignore "Beat Wikipedia" domains [puppet] - 10https://gerrit.wikimedia.org/r/1293796 [21:06:32] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum5003.eqsin.wmnet with OS trixie [21:06:32] (03CR) 10Ssingh: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1293796 (owner: 10BCornwall) [21:08:15] (03CR) 10BCornwall: [C:03+2] ncmonitor: Ignore "Beat Wikipedia" domains [puppet] - 10https://gerrit.wikimedia.org/r/1293796 (owner: 10BCornwall) [21:12:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T426633)', diff saved to https://phabricator.wikimedia.org/P93089 and previous config saved to /var/cache/conftool/dbconfig/20260526-211207-fceratto.json [21:12:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2222.codfw.wmnet with reason: Maintenance [21:12:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2222 (T426633)', diff saved to https://phabricator.wikimedia.org/P93090 and previous config saved to /var/cache/conftool/dbconfig/20260526-211238-fceratto.json [21:14:49] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2058.codfw.wmnet [21:14:50] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_codfw and A:cp [21:15:30] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2057.codfw.wmnet [21:15:31] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_codfw and A:cp [21:19:04] !log dmarc ingress test run mx-in1001 [21:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T426633)', diff saved to https://phabricator.wikimedia.org/P93091 and previous config saved to /var/cache/conftool/dbconfig/20260526-211948-fceratto.json [21:23:00] (03CR) 10Bking: [C:03+2] cirrussearch: return relforge to its previous state [puppet] - 10https://gerrit.wikimedia.org/r/1293791 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [21:27:45] !log Running `/usr/local/bin/foreachwikiindblist "all.dblist - mediamoderation-continuous-scan.dblist - preinstall.dblist" extensions/MediaModeration/maintenance/scanFilesInScanTable.php --use-jobqueue --sleep=1 --poll-sleep=10 --verbose` in tmux session - T421688 [21:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:50] T421688: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688 [21:29:05] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS trixie [21:29:37] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host relforge1008 [21:29:48] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:29:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P93092 and previous config saved to /var/cache/conftool/dbconfig/20260526-212955-fceratto.json [21:30:40] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS trixie [21:31:11] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host relforge1009 [21:31:45] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1010.eqiad.wmnet with OS trixie [21:32:09] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host relforge1010 [21:32:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host relforge1010 [21:32:43] (03PS1) 10Arlolra: Deploy PRV to 5 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293805 (https://phabricator.wikimedia.org/T427331) [21:34:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:35:12] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:35:53] 10SRE-tools, 10Ceph, 06cloud-services-team, 10Cloud-VPS, and 2 others: Enhancements to wmcs.ceph.roll_reboot_osds - https://phabricator.wikimedia.org/T427295#11957363 (10Aklapper) [21:36:30] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host relforge1008 - bking@cumin2002" [21:36:36] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host relforge1008 - bking@cumin2002" [21:36:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:36:37] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache relforge1008.eqiad.wmnet 100.32.64.10.in-addr.arpa 0.0.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:36:40] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) relforge1008.eqiad.wmnet 100.32.64.10.in-addr.arpa 0.0.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:36:41] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1008 [21:37:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11957365 (10Dzahn) @AnnieKim_WMDE WMDE uses firstname.lastname@ but WMF uses RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:40:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P93093 and previous config saved to /var/cache/conftool/dbconfig/20260526-214003-fceratto.json [21:40:47] bking@cumin2002 reimage (PID 1119063) is awaiting input [21:40:59] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1008 [21:40:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host relforge1008 [21:42:53] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host relforge1009 - bking@cumin2002" [21:42:56] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1010.eqiad.wmnet with reason: host reimage [21:42:59] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host relforge1009 - bking@cumin2002" [21:42:59] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:42:59] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache relforge1009.eqiad.wmnet 120.48.64.10.in-addr.arpa 0.2.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:43:03] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) relforge1009.eqiad.wmnet 120.48.64.10.in-addr.arpa 0.2.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:43:04] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1009 [21:43:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11957375 (10Dwisehaupt) a:05Jgreen→03Dwisehaupt [21:44:25] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1009 [21:44:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host relforge1009 [21:45:53] 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T418136#11957394 (10RobH) 05Open→03Declined [21:47:28] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp6015.drmrs.wmnet [21:49:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1010.eqiad.wmnet with reason: host reimage [21:50:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T426633)', diff saved to https://phabricator.wikimedia.org/P93094 and previous config saved to /var/cache/conftool/dbconfig/20260526-215011-fceratto.json [21:50:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [21:50:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2163 (T426633)', diff saved to https://phabricator.wikimedia.org/P93095 and previous config saved to /var/cache/conftool/dbconfig/20260526-215043-fceratto.json [21:51:39] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1008.eqiad.wmnet with reason: host reimage [21:53:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293701 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [21:54:47] (03Merged) 10jenkins-bot: MultimediaViewer: enable image carousel as a beta feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293701 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [21:54:56] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1009.eqiad.wmnet with reason: host reimage [21:55:12] !log egardner@deploy1003 Started scap sync-world: Backport for [[gerrit:1293701|MultimediaViewer: enable image carousel as a beta feature on testwiki (T426799)]] [21:55:16] T426799: [Image Browsing] Launch image carousel as beta feature - https://phabricator.wikimedia.org/T426799 [21:56:08] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp6015.drmrs.wmnet [21:56:20] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1010.eqiad.wmnet with OS trixie [21:56:29] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp6015.drmrs.wmnet [21:57:09] !log egardner@deploy1003 egardner, mfossati: Backport for [[gerrit:1293701|MultimediaViewer: enable image carousel as a beta feature on testwiki (T426799)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:58:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T426633)', diff saved to https://phabricator.wikimedia.org/P93096 and previous config saved to /var/cache/conftool/dbconfig/20260526-215803-fceratto.json [21:59:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1008.eqiad.wmnet with reason: host reimage [22:00:24] !log egardner@deploy1003 egardner, mfossati: Continuing with deployment [22:01:02] robh@cumin2002 upgrade-firmware (PID 1136672) is awaiting input [22:02:43] bking@cumin2002 reimage (PID 1118115) is awaiting input [22:03:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1009.eqiad.wmnet with reason: host reimage [22:04:43] !log egardner@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293701|MultimediaViewer: enable image carousel as a beta feature on testwiki (T426799)]] (duration: 09m 30s) [22:04:47] T426799: [Image Browsing] Launch image carousel as beta feature - https://phabricator.wikimedia.org/T426799 [22:05:11] That's it for reader deploys for today [22:05:38] sukhe@cumin1003 reimage (PID 916054) is awaiting input [22:08:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P93097 and previous config saved to /var/cache/conftool/dbconfig/20260526-220811-fceratto.json [22:08:13] (03PS1) 10Bking: relforge: remove logstash (gelf) profile [puppet] - 10https://gerrit.wikimedia.org/r/1293809 (https://phabricator.wikimedia.org/T324335) [22:08:22] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1008.eqiad.wmnet with OS trixie [22:08:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1293809 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [22:10:01] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1009.eqiad.wmnet with OS trixie [22:18:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P93098 and previous config saved to /var/cache/conftool/dbconfig/20260526-221819-fceratto.json [22:23:48] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp6015.drmrs.wmnet [22:28:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T426633)', diff saved to https://phabricator.wikimedia.org/P93099 and previous config saved to /var/cache/conftool/dbconfig/20260526-222828-fceratto.json [22:28:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [22:28:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2164 (T426633)', diff saved to https://phabricator.wikimedia.org/P93100 and previous config saved to /var/cache/conftool/dbconfig/20260526-222848-fceratto.json [22:31:41] RECOVERY - Host cp6015 is UP: PING OK - Packet loss = 0%, RTA = 87.30 ms [22:32:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T426633)', diff saved to https://phabricator.wikimedia.org/P93101 and previous config saved to /var/cache/conftool/dbconfig/20260526-223556-fceratto.json [22:45:58] FIRING: ProbeDown: Service upload:80 has failed probes (http_upload_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:46:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P93103 and previous config saved to /var/cache/conftool/dbconfig/20260526-224604-fceratto.json [22:50:58] RESOLVED: ProbeDown: Service upload:80 has failed probes (http_upload_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:53:58] FIRING: ProbeDown: Service upload:80 has failed probes (http_upload_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:56:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P93104 and previous config saved to /var/cache/conftool/dbconfig/20260526-225612-fceratto.json [22:58:58] RESOLVED: ProbeDown: Service upload:80 has failed probes (http_upload_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:59:32] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11957602 (10RobH) a:05RobH→03ssingh @ssingh, After flashing firmware of idrac/bios/network to latest the problem hasn't reoccured in half a dozen boots. Want to take this back, reimage, and reintroduce into service?... [23:00:11] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11957605 (10RobH) [23:00:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285412 (https://phabricator.wikimedia.org/T422358) (owner: 10Santiago Faci) [23:06:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T426633)', diff saved to https://phabricator.wikimedia.org/P93105 and previous config saved to /var/cache/conftool/dbconfig/20260526-230620-fceratto.json [23:06:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [23:06:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2165 (T426633)', diff saved to https://phabricator.wikimedia.org/P93106 and previous config saved to /var/cache/conftool/dbconfig/20260526-230650-fceratto.json [23:07:47] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5026.* [23:13:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T426633)', diff saved to https://phabricator.wikimedia.org/P93107 and previous config saved to /var/cache/conftool/dbconfig/20260526-231358-fceratto.json [23:16:40] (03PS1) 10Bartosz Dziewoński: Fix case of 'commonsfinder' in $wgUrlProtocols [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293819 (https://phabricator.wikimedia.org/T426614) [23:17:28] (03PS2) 10Bartosz Dziewoński: Fix case of 'commonsfinder' in $wgUrlProtocols [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293819 (https://phabricator.wikimedia.org/T426614) [23:18:11] (03PS2) 10Bartosz Dziewoński: Configure wgOAuthAutoApprove['protocols'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) [23:20:13] (03PS3) 10Bartosz Dziewoński: Configure wgOAuthAutoApprove['protocols'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) [23:23:50] (03CR) 10CI reject: [V:04-1] Configure wgOAuthAutoApprove['protocols'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) (owner: 10Bartosz Dziewoński) [23:24:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P93108 and previous config saved to /var/cache/conftool/dbconfig/20260526-232406-fceratto.json [23:25:06] (03CR) 10Bartosz Dziewoński: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) (owner: 10Bartosz Dziewoński) [23:27:50] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5026.* [23:34:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P93109 and previous config saved to /var/cache/conftool/dbconfig/20260526-233414-fceratto.json [23:39:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1293821 [23:39:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1293821 (owner: 10TrainBranchBot) [23:44:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T426633)', diff saved to https://phabricator.wikimedia.org/P93110 and previous config saved to /var/cache/conftool/dbconfig/20260526-234421-fceratto.json [23:44:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [23:44:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T426633)', diff saved to https://phabricator.wikimedia.org/P93111 and previous config saved to /var/cache/conftool/dbconfig/20260526-234451-fceratto.json [23:52:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T426633)', diff saved to https://phabricator.wikimedia.org/P93112 and previous config saved to /var/cache/conftool/dbconfig/20260526-235201-fceratto.json [23:54:08] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1293821 (owner: 10TrainBranchBot)