[00:02:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [00:03:24] (03Merged) 10jenkins-bot: Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [00:04:02] !log reprepro include php-apcu_5.1.24-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [00:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:07] T427312: Build PHP 8.3 packages for bookworm - https://phabricator.wikimedia.org/T427312 [00:04:52] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1294438|Init conductwiki (T426984)]] [00:04:56] T426984: Create Conductwiki wiki - https://phabricator.wikimedia.org/T426984 [00:06:50] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1294438|Init conductwiki (T426984)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:08:07] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [00:08:25] FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:39] !log reprepro include php-igbinary_3.2.16-4+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [00:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:40] !log reprepro include php-msgpack_3.0.0-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [00:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:44] T427312: Build PHP 8.3 packages for bookworm - https://phabricator.wikimedia.org/T427312 [00:10:07] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:10:09] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:10:17] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:10:17] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:11:07] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:11:09] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:11:17] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:11:17] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:12:17] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294438|Init conductwiki (T426984)]] (duration: 07m 25s) [00:12:22] T426984: Create Conductwiki wiki - https://phabricator.wikimedia.org/T426984 [00:15:43] (03PS1) 10Ladsgroup: Activate conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294470 (https://phabricator.wikimedia.org/T426984) [00:17:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294470 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [00:18:10] (03Merged) 10jenkins-bot: Activate conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294470 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [00:18:35] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1294470|Activate conductwiki (T426984)]] [00:18:40] T426984: Create Conductwiki wiki - https://phabricator.wikimedia.org/T426984 [00:20:33] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1294470|Activate conductwiki (T426984)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:21:41] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [00:25:47] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294470|Activate conductwiki (T426984)]] (duration: 07m 12s) [00:25:52] T426984: Create Conductwiki wiki - https://phabricator.wikimedia.org/T426984 [00:39:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:42:17] (03PS6) 10Ssingh: LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [00:49:14] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:49:52] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11961688 (10Papaul) [00:53:15] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new subnet in eqsin - pt1979@cumin2002" [00:53:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new subnet in eqsin - pt1979@cumin2002" [00:53:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:54:05] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11961689 (10Papaul) VRRP is up on cr2-eqsin ` cr2-eqsin> show interfaces terse | match "ae1.512|ae1.522" et-0/0/1.512 up... [00:55:22] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 24): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8602/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [01:09:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1294482 [01:09:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1294482 (owner: 10TrainBranchBot) [01:19:27] (03PS3) 10Arlolra: Deploy PRV to 7 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293805 (https://phabricator.wikimedia.org/T427331) [01:22:22] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1294482 (owner: 10TrainBranchBot) [01:22:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:27:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:31:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission lvs1016.eqiad.wmnet - https://phabricator.wikimedia.org/T427451#11961708 (10Jclark-ctr) @bcornwall I see cookbook failed. Is it still good for us to proceed with onsite work? [01:43:06] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:00:53] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:34] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 41s) [02:08:54] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:23] PROBLEM - Host wdqs1015 is DOWN: PING CRITICAL - Packet loss = 100% [02:35:54] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:54] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:30:11] (03PS1) 10Papaul: Add new Eqsin subnet [puppet] - 10https://gerrit.wikimedia.org/r/1294487 (https://phabricator.wikimedia.org/T427393) [03:30:54] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:31:29] (03CR) 10CI reject: [V:04-1] Add new Eqsin subnet [puppet] - 10https://gerrit.wikimedia.org/r/1294487 (https://phabricator.wikimedia.org/T427393) (owner: 10Papaul) [03:40:25] (03PS2) 10Papaul: Add new Eqsin subnet [puppet] - 10https://gerrit.wikimedia.org/r/1294487 (https://phabricator.wikimedia.org/T427393) [03:54:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::5e5e:ab00:c3d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:59:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::5e5e:ab00:c3d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:08:40] FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:17] FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:36:17] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:39:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:05] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11961798 (10Marostegui) >>! In T427388#11960531, @Jhancock.wm wrote: > it halted in the boot and i had to pull the power entirely to get it to reboot and make it past post. There still isn't anything ne... [05:09:55] (03PS2) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1292091 [05:21:19] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2189 crashed - https://phabricator.wikimedia.org/T427376#11961807 (10Marostegui) >>! In T427376#11960059, @Jhancock.wm wrote: > @FCeratto-WMF okay the error code we got was inconclusive. it could mean a lot of things including just out of date firmware. I've updated... [05:21:23] (03PS1) 10Marostegui: Revert "db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1294491 [05:43:06] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:56:23] (03PS1) 10Marostegui: db2212: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1294807 (https://phabricator.wikimedia.org/T427388) [05:57:18] (03CR) 10Marostegui: [C:03+2] db2212: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1294807 (https://phabricator.wikimedia.org/T427388) (owner: 10Marostegui) [05:58:07] (03PS1) 10Thiemo Kreuz (WMDE): Don't run the click intent experiment on mobile [extensions/Cite] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294808 (https://phabricator.wikimedia.org/T426743) [05:59:36] (03CR) 10Hashar: "I am quite happy that got caught ahead of time! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1275537 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T0600) [06:00:04] marostegui, Amir1, and federico3: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T0600). nyaa~ [06:03:25] RESOLVED: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s8 T426095 [06:04:07] T426095: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T426095 [06:04:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db1193 with weight 0 T426095', diff saved to https://phabricator.wikimedia.org/P93298 and previous config saved to /var/cache/conftool/dbconfig/20260528-060412-fceratto.json [06:07:55] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db1193 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1286427 (https://phabricator.wikimedia.org/T426095) (owner: 10Gerrit maintenance bot) [06:08:38] (03PS1) 10Arnaudb: gitlab: upgrade gitlab-runners and gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/1294809 (https://phabricator.wikimedia.org/T427436) [06:10:24] !log Starting s8 eqiad failover from db1209 to db1193 - T426095 [06:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:28] T426095: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T426095 [06:10:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T426095', diff saved to https://phabricator.wikimedia.org/P93299 and previous config saved to /var/cache/conftool/dbconfig/20260528-061048-fceratto.json [06:11:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db1193 to s8 primary and set section read-write T426095', diff saved to https://phabricator.wikimedia.org/P93300 and previous config saved to /var/cache/conftool/dbconfig/20260528-061138-fceratto.json [06:14:13] (03CR) 10Federico Ceratto: [C:03+1] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286428 (https://phabricator.wikimedia.org/T426095) (owner: 10Gerrit maintenance bot) [06:14:21] (03CR) 10Federico Ceratto: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286428 (https://phabricator.wikimedia.org/T426095) (owner: 10Gerrit maintenance bot) [06:14:47] !log fceratto@dns1005 START - running authdns-update [06:15:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Cite] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294808 (https://phabricator.wikimedia.org/T426743) (owner: 10Thiemo Kreuz (WMDE)) [06:15:41] (03CR) 10WMDE-Fisch: [C:03+1] Don't run the click intent experiment on mobile [extensions/Cite] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294808 (https://phabricator.wikimedia.org/T426743) (owner: 10Thiemo Kreuz (WMDE)) [06:16:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool db1209 T426095', diff saved to https://phabricator.wikimedia.org/P93301 and previous config saved to /var/cache/conftool/dbconfig/20260528-061609-fceratto.json [06:16:14] T426095: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T426095 [06:16:28] !log fceratto@dns1005 END - running authdns-update [06:18:06] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [06:19:09] (03CR) 10Ayounsi: Add new Eqsin subnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294487 (https://phabricator.wikimedia.org/T427393) (owner: 10Papaul) [06:23:55] jouncebot: now [06:23:55] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T0600) [06:23:55] For the next 0 hour(s) and 6 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T0600) [06:24:05] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [06:24:24] I am going to restart the CI Jenkins for some plugins upgrades [06:25:28] !log Restarting CI Jenkins for plugins upgrades [06:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:42] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [06:33:50] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:33:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1167 (T426633)', diff saved to https://phabricator.wikimedia.org/P93302 and previous config saved to /var/cache/conftool/dbconfig/20260528-063357-fceratto.json [06:38:16] (03PS2) 10Ryan Kemper: OpenSearch: Add required config for bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [06:38:42] (03CR) 10Federico Ceratto: sre.mysql.pool: Support depooling unreachable hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1294265 (https://phabricator.wikimedia.org/T427381) (owner: 10Federico Ceratto) [06:38:55] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [06:40:58] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: re-rack mc2055 (before Jun 9th) - https://phabricator.wikimedia.org/T427373#11961888 (10jijiki) @Jhancock.wm Thank you! I have added a calendar invite as a reminder for Jun 2nd. [06:42:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T426633)', diff saved to https://phabricator.wikimedia.org/P93303 and previous config saved to /var/cache/conftool/dbconfig/20260528-064217-fceratto.json [06:43:48] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1167: Reboot completed [06:43:54] 06SRE, 10observability, 10Observability-Alerting: Alerts showing "AlertLintProblem" - https://phabricator.wikimedia.org/T427469 (10Marostegui) 03NEW [06:45:59] (03CR) 10Jelto: [C:03+2] gitlab: upgrade gitlab-runners and gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/1294809 (https://phabricator.wikimedia.org/T427436) (owner: 10Arnaudb) [06:52:53] (03CR) 10Ryan Kemper: OpenSearch: Add required config for bootstrapping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [07:00:05] Amir1, urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T0700). Please do the needful. [07:00:05] codders, Robertsky, and WMDE-Fisch: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] o/ [07:00:11] o? [07:00:13] o/ [07:00:27] \o [07:01:07] (03PS3) 10Ryan Kemper: OpenSearch: Add required config for bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [07:01:16] I could deploy [07:01:25] thank you! [07:01:28] thanks! [07:02:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289898 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [07:03:04] (03Merged) 10jenkins-bot: Disable support for PHP-serialized EntityData on Wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289898 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [07:04:03] !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1289898|Disable support for PHP-serialized EntityData on Wikidata production (T98035)]] [07:04:07] T98035: [Task] Drop support for php-serialized output from Special:EntityData - https://phabricator.wikimedia.org/T98035 [07:06:08] !log wmde-fisch@deploy1003 wmde-fisch, arthurtaylor: Backport for [[gerrit:1289898|Disable support for PHP-serialized EntityData on Wikidata production (T98035)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:06:14] codders: It's on the test servers [07:06:19] testing now [07:06:54] yep - looks good. please proceed [07:07:04] Okay [07:07:08] !log wmde-fisch@deploy1003 wmde-fisch, arthurtaylor: Continuing with deployment [07:08:50] (03CR) 10Arnaudb: [C:03+2] gerrit: drop /srv/gerrit/plugins [puppet] - 10https://gerrit.wikimedia.org/r/1193832 (owner: 10Hashar) [07:11:18] !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289898|Disable support for PHP-serialized EntityData on Wikidata production (T98035)]] (duration: 07m 15s) [07:11:24] T98035: [Task] Drop support for php-serialized output from Special:EntityData - https://phabricator.wikimedia.org/T98035 [07:11:29] thanks! [07:11:51] codders: You're welcome [07:12:01] standing by [07:12:08] robertsky: Moving on with your patch [07:12:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) (owner: 10Robertsky) [07:12:46] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 06Release-Engineering-Team (Radar), 07User-notice: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#11961950 (10MoritzMuehlenhoff) I opened a ticket at https://anonticket.torproject.org/ and mirrors.wikimedia.org has bee... [07:13:21] (03Merged) 10jenkins-bot: Update wikimania wordmark for 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) (owner: 10Robertsky) [07:13:43] !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1270986|Update wikimania wordmark for 2026 (T413331)]] [07:13:48] T413331: Add logo to Wikimania on Vector 2022 - https://phabricator.wikimedia.org/T413331 [07:15:09] (03PS2) 10Muehlenhoff: mirrors: Disable tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) [07:15:40] !log wmde-fisch@deploy1003 wmde-fisch, robertsky: Backport for [[gerrit:1270986|Update wikimania wordmark for 2026 (T413331)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:16:04] robertsky: Want to have a look on it on the test servers or should I just go on? [07:16:16] i have tested. go ahead. :) [07:16:25] !log wmde-fisch@deploy1003 wmde-fisch, robertsky: Continuing with deployment [07:16:32] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1167: Reboot completed [07:18:29] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1264.eqiad.wmnet with reason: Maintenance [07:18:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1264 (T426633)', diff saved to https://phabricator.wikimedia.org/P93307 and previous config saved to /var/cache/conftool/dbconfig/20260528-071836-fceratto.json [07:20:37] !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270986|Update wikimania wordmark for 2026 (T413331)]] (duration: 06m 54s) [07:20:41] T413331: Add logo to Wikimania on Vector 2022 - https://phabricator.wikimedia.org/T413331 [07:20:44] robertsky: All done [07:21:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [extensions/Cite] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294808 (https://phabricator.wikimedia.org/T426743) (owner: 10Thiemo Kreuz (WMDE)) [07:21:46] thanks! [07:22:15] (03PS1) 10Brouberol: growthbook: allow WMDE engineers to self-enroll [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294817 (https://phabricator.wikimedia.org/T418665) [07:22:26] (03Merged) 10jenkins-bot: Don't run the click intent experiment on mobile [extensions/Cite] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294808 (https://phabricator.wikimedia.org/T426743) (owner: 10Thiemo Kreuz (WMDE)) [07:22:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [07:22:53] !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1294808|Don't run the click intent experiment on mobile (T426743)]] [07:22:59] T426743: Setup A/B test experiment on TestKitchen UI - https://phabricator.wikimedia.org/T426743 [07:23:27] (03PS1) 10Slyngshede: Update to CAS version 7.3.7.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1294818 [07:23:27] !log tgr@deploy1003 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwikisource --logwiki=metawiki Ioed Renamed_user_4232d41570b9e8f46ef150e5e360e446 # T427459 [07:23:32] T427459: Unblock stuck global rename of Renamed user 4232d41570b9e8f46ef150e5e360e446 - https://phabricator.wikimedia.org/T427459 [07:24:39] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [07:24:49] !log wmde-fisch@deploy1003 thiemowmde, wmde-fisch: Backport for [[gerrit:1294808|Don't run the click intent experiment on mobile (T426743)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:24:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1264 (T426633)', diff saved to https://phabricator.wikimedia.org/P93308 and previous config saved to /var/cache/conftool/dbconfig/20260528-072458-fceratto.json [07:25:14] !log wmde-fisch@deploy1003 thiemowmde, wmde-fisch: Continuing with deployment [07:26:06] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1294818 (owner: 10Slyngshede) [07:27:07] (03CR) 10Slyngshede: [V:03+2 C:03+2] Update to CAS version 7.3.7.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1294818 (owner: 10Slyngshede) [07:27:47] (03PS3) 10Muehlenhoff: mirrors: Disable tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) [07:27:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [07:28:19] (03CR) 10CI reject: [V:04-1] mirrors: Disable tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [07:29:22] !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294808|Don't run the click intent experiment on mobile (T426743)]] (duration: 06m 29s) [07:29:27] T426743: Setup A/B test experiment on TestKitchen UI - https://phabricator.wikimedia.org/T426743 [07:30:09] I'm done with the backport window \o [07:34:38] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [07:35:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1264', diff saved to https://phabricator.wikimedia.org/P93309 and previous config saved to /var/cache/conftool/dbconfig/20260528-073506-fceratto.json [07:36:56] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [07:37:51] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2189: repool after crash [07:37:52] (03CR) 10Marostegui: [C:03+2] Revert "db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1294491 (owner: 10Marostegui) [07:38:19] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2189 crashed - https://phabricator.wikimedia.org/T427376#11962032 (10Marostegui) 05Open→03Resolved [07:39:33] (03PS1) 10Slyngshede: Move Debian package to version 7.3.7.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1294824 [07:43:55] (03PS9) 10Daniel Kinzler: Rakefile: Run chart specific tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) [07:44:40] (03PS1) 10WMDE-Fisch: Update VE core submodule to master (9cf5524e7) [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) [07:44:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1294824 (owner: 10Slyngshede) [07:45:10] (03CR) 10Slyngshede: [V:03+2 C:03+2] Move Debian package to version 7.3.7.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1294824 (owner: 10Slyngshede) [07:45:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1264', diff saved to https://phabricator.wikimedia.org/P93311 and previous config saved to /var/cache/conftool/dbconfig/20260528-074513-fceratto.json [07:45:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [07:47:04] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [07:50:34] (03PS1) 10Komla Sapaty: profile::toolforge::bastion: add SSH login activity export timer [puppet] - 10https://gerrit.wikimedia.org/r/1294864 [07:51:06] (03CR) 10CI reject: [V:04-1] profile::toolforge::bastion: add SSH login activity export timer [puppet] - 10https://gerrit.wikimedia.org/r/1294864 (owner: 10Komla Sapaty) [07:55:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1264 (T426633)', diff saved to https://phabricator.wikimedia.org/P93313 and previous config saved to /var/cache/conftool/dbconfig/20260528-075521-fceratto.json [07:55:42] (03PS5) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) [07:55:49] (03PS4) 10Muehlenhoff: mirrors: Disable tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) [07:56:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [07:56:22] (03CR) 10CI reject: [V:04-1] mirrors: Disable tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [07:56:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020,1022-1023].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:56:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1211 (T426633)', diff saved to https://phabricator.wikimedia.org/P93314 and previous config saved to /var/cache/conftool/dbconfig/20260528-075631-fceratto.json [07:56:43] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Update VE core submodule to master (9cf5524e7) [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [08:00:05] jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T0800) [08:00:44] morning, train will happen in a few [08:00:45] (03PS2) 10Trueg: dse-k8s-eqiad: Add wdqs namespace for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) [08:03:08] (03PS1) 10TrainBranchBot: group2 to 1.47.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294919 (https://phabricator.wikimedia.org/T423913) [08:03:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T426633)', diff saved to https://phabricator.wikimedia.org/P93315 and previous config saved to /var/cache/conftool/dbconfig/20260528-080309-fceratto.json [08:03:11] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294919 (https://phabricator.wikimedia.org/T423913) (owner: 10TrainBranchBot) [08:03:35] (03PS1) 10Brouberol: cephosd: export user/bucket-scoped stats to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1294917 (https://phabricator.wikimedia.org/T427404) [08:04:47] (03Merged) 10jenkins-bot: group2 to 1.47.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294919 (https://phabricator.wikimedia.org/T423913) (owner: 10TrainBranchBot) [08:05:23] !log hashar@deploy1003 Started deploy [integration/docroot@2a51016]: build: update dependencies + eslint fix in comment. f021d3f..2a51016 [08:05:37] !log hashar@deploy1003 Finished deploy [integration/docroot@2a51016]: build: update dependencies + eslint fix in comment. f021d3f..2a51016 (duration: 00m 13s) [08:06:40] (03PS5) 10Muehlenhoff: mirrors: Disable tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) [08:06:53] (03PS1) 10Slyngshede: IDP: Failover for update [dns] - 10https://gerrit.wikimedia.org/r/1294922 [08:09:03] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1209: Test [08:10:48] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.47.0-wmf.4 refs T423913 [08:10:52] T423913: 1.47.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T423913 [08:13:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P93318 and previous config saved to /var/cache/conftool/dbconfig/20260528-081316-fceratto.json [08:14:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [08:15:34] (03CR) 10Slyngshede: [C:03+2] IDP: Failover for update [dns] - 10https://gerrit.wikimedia.org/r/1294922 (owner: 10Slyngshede) [08:16:02] !log slyngshede@dns1004 START - running authdns-update [08:17:47] !log slyngshede@dns1004 END - running authdns-update [08:17:51] (03PS6) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) [08:20:08] (03CR) 10Muehlenhoff: [C:03+2] Switch rpki2003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294241 (owner: 10Muehlenhoff) [08:20:36] (03PS1) 10Dreamy Jazz: hCaptcha: Regenerate VisualEditor captcha token per save attempt [extensions/ConfirmEdit] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294925 (https://phabricator.wikimedia.org/T427334) [08:21:20] jouncebot: nowandnext [08:21:20] For the next 1 hour(s) and 38 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T0800) [08:21:20] In 1 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1000) [08:21:41] jnuche: Any problem with me deploying? [08:23:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2189: repool after crash [08:23:17] Dreamy_Jazz: nope, you can go ahead [08:23:22] Thanks [08:23:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P93320 and previous config saved to /var/cache/conftool/dbconfig/20260528-082324-fceratto.json [08:24:19] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1209: Test [08:25:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294925 (https://phabricator.wikimedia.org/T427334) (owner: 10Dreamy Jazz) [08:31:55] (03PS1) 10Slyngshede: IDP: Upgrade to CAS 7.3.7.1 [dns] - 10https://gerrit.wikimedia.org/r/1294928 [08:33:11] (03PS1) 10Jcrespo: mediabackup: Set backup[12]00[4-7] as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1294929 (https://phabricator.wikimedia.org/T420506) [08:33:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T426633)', diff saved to https://phabricator.wikimedia.org/P93322 and previous config saved to /var/cache/conftool/dbconfig/20260528-083331-fceratto.json [08:33:39] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294929 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [08:33:46] (03CR) 10CI reject: [V:04-1] mediabackup: Set backup[12]00[4-7] as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1294929 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [08:34:00] (03CR) 10Atsuko: [C:04-1] "typo" [puppet] - 10https://gerrit.wikimedia.org/r/1294917 (https://phabricator.wikimedia.org/T427404) (owner: 10Brouberol) [08:34:31] (03PS2) 10Jcrespo: mediabackup: Set backup[12]00[4-7] as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1294929 (https://phabricator.wikimedia.org/T420506) [08:34:38] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:34:50] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294929 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [08:34:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1025].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:35:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T426633)', diff saved to https://phabricator.wikimedia.org/P93323 and previous config saved to /var/cache/conftool/dbconfig/20260528-083504-fceratto.json [08:35:46] (03PS2) 10Brouberol: cephosd: export user/bucket-scoped stats to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1294917 (https://phabricator.wikimedia.org/T427404) [08:36:17] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:36:55] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and quick functionality test on idp-test and 1005 went fine" [dns] - 10https://gerrit.wikimedia.org/r/1294928 (owner: 10Slyngshede) [08:36:56] (03CR) 10Brouberol: cephosd: export user/bucket-scoped stats to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294917 (https://phabricator.wikimedia.org/T427404) (owner: 10Brouberol) [08:37:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:37:15] (03CR) 10Marostegui: [C:03+1] mediabackup: Set backup[12]00[4-7] as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1294929 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [08:37:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet [08:38:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:38:47] (03CR) 10Jcrespo: [C:03+2] mediabackup: Set backup[12]00[4-7] as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1294929 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [08:39:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet [08:39:34] (03PS1) 10Muehlenhoff: Switch rpkivalidator role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294930 [08:39:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:51] (03Merged) 10jenkins-bot: hCaptcha: Regenerate VisualEditor captcha token per save attempt [extensions/ConfirmEdit] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294925 (https://phabricator.wikimedia.org/T427334) (owner: 10Dreamy Jazz) [08:40:01] (03PS1) 10Ayounsi: Nokia: also alow anycast prefixes from Ganeti peers [homer/public] - 10https://gerrit.wikimedia.org/r/1294931 (https://phabricator.wikimedia.org/T423384) [08:40:06] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1294925|hCaptcha: Regenerate VisualEditor captcha token per save attempt (T427334)]] [08:40:13] T427334: hCaptcha VisualEditor: hCaptcha token is reused if AbuseFilter blocks the edit - https://phabricator.wikimedia.org/T427334 [08:40:26] (03CR) 10Atsuko: [C:03+1] cephosd: export user/bucket-scoped stats to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1294917 (https://phabricator.wikimedia.org/T427404) (owner: 10Brouberol) [08:40:50] (03CR) 10Atsuko: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1294917 (https://phabricator.wikimedia.org/T427404) (owner: 10Brouberol) [08:41:49] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1294925|hCaptcha: Regenerate VisualEditor captcha token per save attempt (T427334)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:41:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T426633)', diff saved to https://phabricator.wikimedia.org/P93324 and previous config saved to /var/cache/conftool/dbconfig/20260528-084149-fceratto.json [08:42:03] (03CR) 10Muehlenhoff: [C:03+2] Add urldownloader[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/1293743 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [08:43:16] Testing.... [08:43:45] (03PS2) 10Muehlenhoff: Update role contacts in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1291711 [08:43:55] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:44:30] (03PS2) 10Ayounsi: Nokia: also allow anycast prefixes from Ganeti peers [homer/public] - 10https://gerrit.wikimedia.org/r/1294931 (https://phabricator.wikimedia.org/T423384) [08:45:13] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [08:46:01] (03CR) 10Slyngshede: [C:03+2] IDP: Upgrade to CAS 7.3.7.1 [dns] - 10https://gerrit.wikimedia.org/r/1294928 (owner: 10Slyngshede) [08:46:27] !log slyngshede@dns1004 START - running authdns-update [08:47:05] !log Upgrade IDP to CAS 7.3.7.1 [08:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:17] !log joal@deploy1003 Started deploy [analytics/refinery@878cb24] (hadoop-test): Regular analytics weekly train TEST -2 [analytics/refinery@878cb24a] [08:47:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1165: Reboot completed [08:48:12] !log slyngshede@dns1004 END - running authdns-update [08:48:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance [08:49:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1209 (T419635)', diff saved to https://phabricator.wikimedia.org/P93326 and previous config saved to /var/cache/conftool/dbconfig/20260528-084906-fceratto.json [08:49:11] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:49:17] !log joal@deploy1003 Finished deploy [analytics/refinery@878cb24] (hadoop-test): Regular analytics weekly train TEST -2 [analytics/refinery@878cb24a] (duration: 02m 00s) [08:49:26] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294925|hCaptcha: Regenerate VisualEditor captcha token per save attempt (T427334)]] (duration: 09m 20s) [08:49:31] T427334: hCaptcha VisualEditor: hCaptcha token is reused if AbuseFilter blocks the edit - https://phabricator.wikimedia.org/T427334 [08:49:36] I'm done with scap [08:49:36] !log joal@deploy1003 Started deploy [analytics/refinery@878cb24]: Regular analytics weekly train - 2 [analytics/refinery@878cb24a] [08:49:47] (03CR) 10Brouberol: [C:03+2] cephosd: export user/bucket-scoped stats to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1294917 (https://phabricator.wikimedia.org/T427404) (owner: 10Brouberol) [08:50:09] !log cr1-codfw# delete protocols bgp group fundraising family inet6 - T423384 [08:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:14] T423384: Investigate internal rejected prefixes - https://phabricator.wikimedia.org/T423384 [08:50:54] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:51:42] (03PS1) 10Muehlenhoff: aptrepo: Add Routinator for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1294933 [08:51:53] (03PS2) 10Muehlenhoff: aptrepo: Add Routinator for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1294933 [08:52:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T419635)', diff saved to https://phabricator.wikimedia.org/P93327 and previous config saved to /var/cache/conftool/dbconfig/20260528-085216-fceratto.json [08:55:28] (03CR) 10Gmodena: dse-k8s-eqiad: Add wdqs namespace for the new deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [08:56:30] !log joal@deploy1003 Finished deploy [analytics/refinery@878cb24]: Regular analytics weekly train - 2 [analytics/refinery@878cb24a] (duration: 06m 54s) [08:56:46] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@6200ab1] (releasing): T427406 Testing on backup host [08:57:09] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@6200ab1] (releasing): T427406 Testing on backup host (duration: 00m 53s) [08:58:07] (03CR) 10Trueg: dse-k8s-eqiad: Add wdqs namespace for the new deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [08:58:43] !log joal@deploy1003 Started deploy [analytics/refinery@878cb24] (thin): Regular analytics weekly train THIN - 2[analytics/refinery@878cb24a] [08:59:37] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@6200ab1] (releasing): T427406 Deploying to prod [08:59:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1148:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1148 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:00:52] !log joal@deploy1003 Finished deploy [analytics/refinery@878cb24] (thin): Regular analytics weekly train THIN - 2[analytics/refinery@878cb24a] (duration: 02m 08s) [09:01:01] (03CR) 10Cathal Mooney: [C:03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1294931 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi) [09:01:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [09:01:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T426633)', diff saved to https://phabricator.wikimedia.org/P93328 and previous config saved to /var/cache/conftool/dbconfig/20260528-090114-fceratto.json [09:02:01] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@6200ab1] (releasing): T427406 Deploying to prod (duration: 02m 31s) [09:02:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P93329 and previous config saved to /var/cache/conftool/dbconfig/20260528-090224-fceratto.json [09:02:26] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Nice work with the cumin host selection, I'm taking notes :)" [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [09:04:07] (03CR) 10Muehlenhoff: [C:03+2] autoinstall: Stop using mirrors.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1294285 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [09:04:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1148:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1148 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:08:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T426633)', diff saved to https://phabricator.wikimedia.org/P93331 and previous config saved to /var/cache/conftool/dbconfig/20260528-090826-fceratto.json [09:08:57] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:09:11] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:09:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:09:51] (03PS6) 10Muehlenhoff: mirrors: Disable tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) [09:10:22] (03CR) 10Clément Goubert: [C:03+2] Add Wikimedia REST API ?spec route for *.wikinews.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269099 (https://phabricator.wikimedia.org/T418318) (owner: 10Aaron Schulz) [09:12:11] (03CR) 10MGChecker: "This change can be abandoned." [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286465 (https://phabricator.wikimedia.org/T425988) (owner: 10Jforrester) [09:12:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [09:12:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P93332 and previous config saved to /var/cache/conftool/dbconfig/20260528-091231-fceratto.json [09:12:56] (03Merged) 10jenkins-bot: Add Wikimedia REST API ?spec route for *.wikinews.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269099 (https://phabricator.wikimedia.org/T418318) (owner: 10Aaron Schulz) [09:13:05] !log elukey@cumin1003 START - Cookbook sre.hosts.decommission for hosts pki-root1001.eqiad.wmnet [09:13:15] (03Abandoned) 10Hashar: Fix MediaHandler caching to not preserve language [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286465 (https://phabricator.wikimedia.org/T425988) (owner: 10Jforrester) [09:13:38] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice" [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [09:13:59] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:14:06] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:17:49] !log elukey@cumin1003 START - Cookbook sre.dns.netbox [09:17:49] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:17:54] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1165: Reboot completed [09:18:08] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:18:22] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:18:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P93334 and previous config saved to /var/cache/conftool/dbconfig/20260528-091834-fceratto.json [09:18:47] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:18:57] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:19:06] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:20:17] (03CR) 10Muehlenhoff: [C:03+2] mirrors: Disable tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [09:21:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294229 (https://phabricator.wikimedia.org/T427369) (owner: 10STran) [09:22:07] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pki-root1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1003" [09:22:31] (03PS1) 10Brouberol: cephosd: fix radosgw exporter file source path [puppet] - 10https://gerrit.wikimedia.org/r/1294936 (https://phabricator.wikimedia.org/T427404) [09:22:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pki-root1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1003" [09:22:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:22:38] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pki-root1001.eqiad.wmnet [09:22:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T419635)', diff saved to https://phabricator.wikimedia.org/P93335 and previous config saved to /var/cache/conftool/dbconfig/20260528-092239-fceratto.json [09:22:51] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:22:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [09:23:09] jnuche: Any problem with me deploying? [09:23:12] Oops [09:23:16] Wrong message [09:23:18] Sorry for the ping [09:23:22] jouncebot: nowandnext [09:23:22] For the next 0 hour(s) and 36 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T0800) [09:23:23] In 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1000) [09:23:29] :D [09:23:42] (03CR) 10Brouberol: [C:03+2] cephosd: fix radosgw exporter file source path [puppet] - 10https://gerrit.wikimedia.org/r/1294936 (https://phabricator.wikimedia.org/T427404) (owner: 10Brouberol) [09:24:01] (03CR) 10STran: "Per Eric, doing nothing as it should run in 99.9% mode in those cases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [09:24:02] (03PS1) 10Dreamy Jazz: CheckUserLookupUtils: Fix error introduced by strict types [extensions/CheckUser] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294937 (https://phabricator.wikimedia.org/T427480) [09:25:05] (03CR) 10Dreamy Jazz: [C:03+1] "Thanks for checking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [09:26:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [09:26:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294937 (https://phabricator.wikimedia.org/T427480) (owner: 10Dreamy Jazz) [09:26:42] (03PS1) 10Muehlenhoff: Remove remaining bits of Tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294938 (https://phabricator.wikimedia.org/T416707) [09:27:55] (03Merged) 10jenkins-bot: Set minimum edit count for skipcaptcha right to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [09:28:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P93336 and previous config saved to /var/cache/conftool/dbconfig/20260528-092842-fceratto.json [09:36:16] (03CR) 10Ayounsi: [C:03+1] aptrepo: Add Routinator for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1294933 (owner: 10Muehlenhoff) [09:36:43] (03CR) 10Ayounsi: [C:03+1] Switch rpkivalidator role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294930 (owner: 10Muehlenhoff) [09:38:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T426633)', diff saved to https://phabricator.wikimedia.org/P93337 and previous config saved to /var/cache/conftool/dbconfig/20260528-093849-fceratto.json [09:39:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:39:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1169 (T426633)', diff saved to https://phabricator.wikimedia.org/P93338 and previous config saved to /var/cache/conftool/dbconfig/20260528-093920-fceratto.json [09:39:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294938 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [09:40:32] (03Merged) 10jenkins-bot: CheckUserLookupUtils: Fix error introduced by strict types [extensions/CheckUser] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294937 (https://phabricator.wikimedia.org/T427480) (owner: 10Dreamy Jazz) [09:40:47] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1294243|Set minimum edit count for skipcaptcha right to 10 (T426973)]], [[gerrit:1294937|CheckUserLookupUtils: Fix error introduced by strict types (T427480)]] [09:40:54] T426973: Support a configurable minimum edit count requirement before `skipcaptcha` right will take effect - https://phabricator.wikimedia.org/T426973 [09:40:55] T427480: CheckUserLookupUtils: ArchivedRevisionLookup::getArchivedRevisionRecord(): Argument #2 ($revId) must be of type int, string given - https://phabricator.wikimedia.org/T427480 [09:42:32] !log dreamyjazz@deploy1003 dreamyjazz, stran: Backport for [[gerrit:1294243|Set minimum edit count for skipcaptcha right to 10 (T426973)]], [[gerrit:1294937|CheckUserLookupUtils: Fix error introduced by strict types (T427480)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:43:06] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:43:40] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:43:52] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:43:55] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:44:05] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [09:44:18] !log dreamyjazz@deploy1003 dreamyjazz, stran: Continuing with deployment [09:47:50] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [09:48:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T426633)', diff saved to https://phabricator.wikimedia.org/P93339 and previous config saved to /var/cache/conftool/dbconfig/20260528-094807-fceratto.json [09:48:24] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294243|Set minimum edit count for skipcaptcha right to 10 (T426973)]], [[gerrit:1294937|CheckUserLookupUtils: Fix error introduced by strict types (T427480)]] (duration: 07m 37s) [09:48:31] T426973: Support a configurable minimum edit count requirement before `skipcaptcha` right will take effect - https://phabricator.wikimedia.org/T426973 [09:48:32] T427480: CheckUserLookupUtils: ArchivedRevisionLookup::getArchivedRevisionRecord(): Argument #2 ($revId) must be of type int, string given - https://phabricator.wikimedia.org/T427480 [09:50:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by javiermonton@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) (owner: 10JavierMonton) [09:52:27] (03Merged) 10jenkins-bot: stream: webrequest.page_view [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) (owner: 10JavierMonton) [09:52:41] !log javiermonton@deploy1003 Started scap sync-world: Backport for [[gerrit:1290687|stream: webrequest.page_view (T426092 T426091)]] [09:52:46] T426092: Schema and Stream for "webrequest_frontend_text" - https://phabricator.wikimedia.org/T426092 [09:52:47] T426091: Schema and Stream for "webrequest.page_view" - https://phabricator.wikimedia.org/T426091 [09:54:25] !log javiermonton@deploy1003 javiermonton: Backport for [[gerrit:1290687|stream: webrequest.page_view (T426092 T426091)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:55:15] !log javiermonton@deploy1003 javiermonton: Continuing with deployment [09:57:25] (03PS1) 10Jcrespo: bacula: Setup backup1014 & backup2014 also as ES storage nodes [puppet] - 10https://gerrit.wikimedia.org/r/1294941 (https://phabricator.wikimedia.org/T420506) [09:58:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P93340 and previous config saved to /var/cache/conftool/dbconfig/20260528-095814-fceratto.json [09:59:22] !log javiermonton@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290687|stream: webrequest.page_view (T426092 T426091)]] (duration: 06m 41s) [09:59:27] T426092: Schema and Stream for "webrequest_frontend_text" - https://phabricator.wikimedia.org/T426092 [09:59:28] T426091: Schema and Stream for "webrequest.page_view" - https://phabricator.wikimedia.org/T426091 [09:59:29] (03CR) 10CI reject: [V:04-1] bacula: Setup backup1014 & backup2014 also as ES storage nodes [puppet] - 10https://gerrit.wikimedia.org/r/1294941 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1000) [10:04:26] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T427487#11962719 (10elukey) [10:05:05] (03PS2) 10Jcrespo: bacula: Setup backup1014 & backup2014 also as ES storage nodes [puppet] - 10https://gerrit.wikimedia.org/r/1294941 (https://phabricator.wikimedia.org/T420506) [10:05:14] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294941 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [10:05:35] (03PS1) 10Elukey: Remove pki-root1001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1294942 (https://phabricator.wikimedia.org/T427487) [10:08:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P93341 and previous config saved to /var/cache/conftool/dbconfig/20260528-100822-fceratto.json [10:16:23] (03CR) 10Bearloga: [C:03+1] growthbook: allow WMDE engineers to self-enroll [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294817 (https://phabricator.wikimedia.org/T418665) (owner: 10Brouberol) [10:18:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T426633)', diff saved to https://phabricator.wikimedia.org/P93342 and previous config saved to /var/cache/conftool/dbconfig/20260528-101829-fceratto.json [10:18:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [10:19:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1186 (T426633)', diff saved to https://phabricator.wikimedia.org/P93343 and previous config saved to /var/cache/conftool/dbconfig/20260528-101900-fceratto.json [10:23:12] (03PS1) 10Majavah: P:openstack: cloudweb_mcrouter: Migrate to firewall defines [puppet] - 10https://gerrit.wikimedia.org/r/1294946 [10:23:42] (03CR) 10CI reject: [V:04-1] P:openstack: cloudweb_mcrouter: Migrate to firewall defines [puppet] - 10https://gerrit.wikimedia.org/r/1294946 (owner: 10Majavah) [10:23:59] (03PS1) 10JavierMonton: stream: webrequest_page_view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294947 (https://phabricator.wikimedia.org/T425624) [10:24:15] (03PS2) 10Majavah: P:openstack: cloudweb_mcrouter: Migrate to firewall defines [puppet] - 10https://gerrit.wikimedia.org/r/1294946 [10:26:57] !log arthurtaylor@deploy1003 mwscript-k8s job started: extensions/Wikibase/repo/maintenance/changePropertyDataType.php --wiki wikidatawiki --new-data-type external-id --property-id P1748 # T422392 [10:27:02] T422392: Convert property "NCI Thesaurus ID (P1748)" to external identifier - https://phabricator.wikimedia.org/T422392 [10:27:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T426633)', diff saved to https://phabricator.wikimedia.org/P93344 and previous config saved to /var/cache/conftool/dbconfig/20260528-102730-fceratto.json [10:27:40] (03PS3) 10Majavah: P:openstack: cloudweb_mcrouter: Migrate to firewall defines [puppet] - 10https://gerrit.wikimedia.org/r/1294946 [10:27:40] (03PS1) 10Majavah: firewall::client: Fix default for qos [puppet] - 10https://gerrit.wikimedia.org/r/1294948 [10:28:39] !log arthurtaylor@deploy1003 mwscript-k8s job started: extensions/Wikibase/repo/maintenance/changePropertyDataType.php --wiki wikidatawiki --new-data-type external-id --property-id P14223 # T422264 [10:28:43] T422264: Convert property "Digital Public Good ID (P14223)" from String to External identifier - https://phabricator.wikimedia.org/T422264 [10:29:17] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8606/co" [puppet] - 10https://gerrit.wikimedia.org/r/1294946 (owner: 10Majavah) [10:29:45] !log arthurtaylor@deploy1003 mwscript-k8s job started: extensions/Wikibase/repo/maintenance/changePropertyDataType.php --wiki wikidatawiki --new-data-type external-id --property-id P13724 # T406971 [10:29:50] T406971: Change datatype of several string properties to external-id - https://phabricator.wikimedia.org/T406971 [10:33:42] (03PS1) 10Atsuko: translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294949 (https://phabricator.wikimedia.org/T425377) [10:34:45] (03CR) 10Jcrespo: [C:03+2] bacula: Setup backup1014 & backup2014 also as ES storage nodes [puppet] - 10https://gerrit.wikimedia.org/r/1294941 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [10:37:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P93345 and previous config saved to /var/cache/conftool/dbconfig/20260528-103738-fceratto.json [10:38:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T427487#11962831 (10VRiley-WMF) a:03VRiley-WMF [10:39:25] (03PS1) 10Majavah: P:sre::nftables_compat_check: Install python3-pypuppetdb [puppet] - 10https://gerrit.wikimedia.org/r/1294951 [10:41:53] (03PS1) 10Elukey: kerberos: exclude krb1002 to allow reimage and move-vlan [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) [10:42:06] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) (owner: 10Elukey) [10:42:23] (03CR) 10CI reject: [V:04-1] kerberos: exclude krb1002 to allow reimage and move-vlan [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) (owner: 10Elukey) [10:43:16] (03CR) 10Blake: [C:03+2] mcrouter_wancache: swap mc1055 for mc1054 for trixie testing [puppet] - 10https://gerrit.wikimedia.org/r/1294216 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [10:47:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P93346 and previous config saved to /var/cache/conftool/dbconfig/20260528-104745-fceratto.json [10:48:23] (03PS2) 10Elukey: kerberos: exclude krb1002 to allow reimage and move-vlan [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) [10:49:08] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) (owner: 10Elukey) [10:50:22] !log update trixie netboot image for 13.5 point release T427072 [10:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:27] T427072: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072 [10:50:48] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11962861 (10MoritzMuehlenhoff) [10:51:26] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11962864 (10MoritzMuehlenhoff) [10:55:41] !log blake@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [10:55:49] !log blake@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [10:55:51] (03PS3) 10Elukey: kerberos: exclude krb1002 to allow reimage and move-vlan [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) [10:55:57] !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [10:56:03] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) (owner: 10Elukey) [10:56:03] !log blake@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [10:57:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T426633)', diff saved to https://phabricator.wikimedia.org/P93347 and previous config saved to /var/cache/conftool/dbconfig/20260528-105753-fceratto.json [10:58:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1195.eqiad.wmnet with reason: Maintenance [10:58:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [10:58:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1195 (T426633)', diff saved to https://phabricator.wikimedia.org/P93348 and previous config saved to /var/cache/conftool/dbconfig/20260528-105820-fceratto.json [10:59:00] (03CR) 10Muehlenhoff: [C:03+2] Remove remaining bits of Tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294938 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [11:01:11] (03CR) 10Elukey: "@mmuhlenhoff@wikimedia.org what do you think? The idea would be to puppet-configure the kerberos stack to be single host for a bit, to all" [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) (owner: 10Elukey) [11:05:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T426633)', diff saved to https://phabricator.wikimedia.org/P93349 and previous config saved to /var/cache/conftool/dbconfig/20260528-110536-fceratto.json [11:15:40] (03CR) 10Gmodena: dse-k8s-eqiad: Add wdqs namespace for the new deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [11:15:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P93350 and previous config saved to /var/cache/conftool/dbconfig/20260528-111543-fceratto.json [11:23:02] (03CR) 10Muehlenhoff: [C:03+2] Remove pki-root1001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1294942 (https://phabricator.wikimedia.org/T427487) (owner: 10Elukey) [11:23:10] (03CR) 10Muehlenhoff: [C:03+1] Remove pki-root1001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1294942 (https://phabricator.wikimedia.org/T427487) (owner: 10Elukey) [11:24:20] (03CR) 10Muehlenhoff: [C:03+2] pki:multirootca: Switch to nftables on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1289355 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [11:25:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P93351 and previous config saved to /var/cache/conftool/dbconfig/20260528-112551-fceratto.json [11:29:12] (03Abandoned) 10Muehlenhoff: pki:multirootca: Enable nftables on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1286861 (owner: 10Muehlenhoff) [11:31:32] (03PS1) 10Muehlenhoff: Switch the pki:root role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294958 (https://phabricator.wikimedia.org/T416664) [11:35:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T426633)', diff saved to https://phabricator.wikimedia.org/P93352 and previous config saved to /var/cache/conftool/dbconfig/20260528-113559-fceratto.json [11:36:19] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [11:36:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:36:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1196 (T426633)', diff saved to https://phabricator.wikimedia.org/P93353 and previous config saved to /var/cache/conftool/dbconfig/20260528-113635-fceratto.json [11:45:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T426633)', diff saved to https://phabricator.wikimedia.org/P93354 and previous config saved to /var/cache/conftool/dbconfig/20260528-114500-fceratto.json [11:45:20] (03PS1) 10Muehlenhoff: Revert "autoinstall: Stop using mirrors.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1294960 [11:49:42] (03CR) 10Muehlenhoff: [C:03+2] Revert "autoinstall: Stop using mirrors.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1294960 (owner: 10Muehlenhoff) [11:49:48] (03PS2) 10Muehlenhoff: Revert "autoinstall: Stop using mirrors.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1294960 [11:51:09] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11962981 (10MoritzMuehlenhoff) [11:52:26] (03CR) 10Muehlenhoff: [C:03+2] Revert "autoinstall: Stop using mirrors.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1294960 (owner: 10Muehlenhoff) [11:55:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P93355 and previous config saved to /var/cache/conftool/dbconfig/20260528-115508-fceratto.json [11:59:02] (03CR) 10Brouberol: [C:03+2] growthbook: allow WMDE engineers to self-enroll [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294817 (https://phabricator.wikimedia.org/T418665) (owner: 10Brouberol) [11:59:55] (03PS1) 10Muehlenhoff: Record LDAP access for lwilson-ctr [puppet] - 10https://gerrit.wikimedia.org/r/1294963 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1200) [12:00:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [12:01:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [12:01:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [12:02:07] (03PS9) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [12:02:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [12:02:33] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1006.eqiad.wmnet with OS trixie [12:03:04] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for lwilson-ctr [puppet] - 10https://gerrit.wikimedia.org/r/1294963 (owner: 10Muehlenhoff) [12:05:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P93356 and previous config saved to /var/cache/conftool/dbconfig/20260528-120515-fceratto.json [12:07:25] FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:47] jmm@cumin2002 reimage (PID 2034318) is awaiting input [12:09:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:09:30] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:10:07] (03CR) 10Ayounsi: [C:03+2] Nokia: also allow anycast prefixes from Ganeti peers [homer/public] - 10https://gerrit.wikimedia.org/r/1294931 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi) [12:10:47] (03CR) 10Muehlenhoff: "Thta sounds good, the only downside is that we are left with no KDCs at all should krb2002 have issues during the maintenance. The Hiera k" [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) (owner: 10Elukey) [12:10:48] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:11:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:11:30] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:14:17] (03Merged) 10jenkins-bot: Nokia: also allow anycast prefixes from Ganeti peers [homer/public] - 10https://gerrit.wikimedia.org/r/1294931 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi) [12:15:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T426633)', diff saved to https://phabricator.wikimedia.org/P93357 and previous config saved to /var/cache/conftool/dbconfig/20260528-121523-fceratto.json [12:15:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [12:15:38] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1294967 (owner: 10L10n-bot) [12:15:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [12:15:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1206 (T426633)', diff saved to https://phabricator.wikimedia.org/P93358 and previous config saved to /var/cache/conftool/dbconfig/20260528-121551-fceratto.json [12:18:48] (03PS7) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) [12:19:30] (03PS6) 10Clément Goubert: api-gateway: Pre-teardown deprecation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294957 (https://phabricator.wikimedia.org/T426881) [12:19:30] (03CR) 10DCausse: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294949 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [12:19:31] (03CR) 10Clément Goubert: ""Soft" deprecation of the `api-gateway` type deployments in favour of `rest-gateway`." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294957 (https://phabricator.wikimedia.org/T426881) (owner: 10Clément Goubert) [12:19:52] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1294974 (owner: 10L10n-bot) [12:23:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T426633)', diff saved to https://phabricator.wikimedia.org/P93359 and previous config saved to /var/cache/conftool/dbconfig/20260528-122349-fceratto.json [12:25:55] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1006.eqiad.wmnet with OS trixie [12:26:14] (03CR) 10Jelto: [C:03+2] miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [12:26:42] (03PS10) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [12:28:58] (03Merged) 10jenkins-bot: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [12:29:54] (03PS1) 10Muehlenhoff: mirrors: Disable osbpo sync [puppet] - 10https://gerrit.wikimedia.org/r/1294980 (https://phabricator.wikimedia.org/T416707) [12:30:26] (03CR) 10CI reject: [V:04-1] mirrors: Disable osbpo sync [puppet] - 10https://gerrit.wikimedia.org/r/1294980 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [12:30:48] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:33:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P93360 and previous config saved to /var/cache/conftool/dbconfig/20260528-123357-fceratto.json [12:33:59] (03CR) 10Elukey: [C:03+2] Remove pki-root1001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1294942 (https://phabricator.wikimedia.org/T427487) (owner: 10Elukey) [12:36:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:37:20] (03CR) 10Brouberol: "Oh, I didn't see this patch and went ahead with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1294817 which is already m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294372 (https://phabricator.wikimedia.org/T418665) (owner: 10Dr0ptp4kt) [12:38:45] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [12:39:34] (03PS3) 10Papaul: Add new Eqsin subnet [puppet] - 10https://gerrit.wikimedia.org/r/1294487 (https://phabricator.wikimedia.org/T427393) [12:39:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:48] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [12:41:22] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 1 (backup1013), Fresh: 135 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:43:36] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [12:44:02] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [12:44:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P93361 and previous config saved to /var/cache/conftool/dbconfig/20260528-124404-fceratto.json [12:48:00] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [12:48:23] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [12:50:17] (03CR) 10Brouberol: profile::kafka::broker: add ACLs in a file (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [12:51:51] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11963132 (10elukey) [12:54:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T426633)', diff saved to https://phabricator.wikimedia.org/P93362 and previous config saved to /var/cache/conftool/dbconfig/20260528-125412-fceratto.json [12:54:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [12:54:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1218 (T426633)', diff saved to https://phabricator.wikimedia.org/P93363 and previous config saved to /var/cache/conftool/dbconfig/20260528-125439-fceratto.json [12:54:53] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294984 (https://phabricator.wikimedia.org/T407341) [12:55:57] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294985 (https://phabricator.wikimedia.org/T407341) [12:58:32] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11963146 (10Papaul) @BCornwall hello can you please provide me with one CP node in rack 604 that i can use later on today to test the r... [12:59:17] (03PS1) 10Matthias Mullie: Image Carousel: check candidate pages [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294986 (https://phabricator.wikimedia.org/T427336) [12:59:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294986 (https://phabricator.wikimedia.org/T427336) (owner: 10Matthias Mullie) [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1300). [13:00:05] WMDE-Fisch and matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] o/ [13:00:10] o/ [13:00:16] \o I'll self serve [13:00:21] :-) [13:00:30] I’m around but fairly busy today so if someone else can deploy that would be great :) [13:00:31] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#11963150 (10Papaul) [13:00:37] WMDE-Fisch: go ahead :) [13:00:44] Looks like we're both self-servicing! [13:01:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [13:01:13] (03PS11) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [13:01:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T426633)', diff saved to https://phabricator.wikimedia.org/P93364 and previous config saved to /var/cache/conftool/dbconfig/20260528-130139-fceratto.json [13:03:03] 06SRE, 13Patch-For-Review: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11963164 (10elukey) Clean up on Jumbo: ` elukey@kafka-jumbo1010:~$ sudo -E kafka acls --remove --allow-principal User:ANONYMOUS --operation Read --operation Describe --allow-host 10.64.36.130 --topic webre... [13:03:07] (03PS1) 10Gkyziridis: wgRestSandboxSpecs: Add LiftWing API OpenAPI specs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294988 (https://phabricator.wikimedia.org/T426081) [13:04:00] (03PS5) 10Elukey: profile::kafka::broker: add ACLs in a file [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) [13:04:04] (03CR) 10Elukey: profile::kafka::broker: add ACLs in a file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [13:05:04] (03CR) 10Elukey: [C:03+1] Switch the pki:root role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294958 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [13:05:33] (03PS2) 10Gkyziridis: wgRestSandboxSpecs: Add LiftWing API OpenAPI specs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294988 (https://phabricator.wikimedia.org/T426081) [13:05:40] (03Abandoned) 10Bking: discovery: Replace soon-to-be-expired intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1259216 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [13:07:22] (03PS1) 10CDanis: cli: add --sort-groups and --reverse-sort options [software/cumin] - 10https://gerrit.wikimedia.org/r/1294990 [13:08:00] (03CR) 10WMDE-Fisch: [C:03+2] "resubmit unrelated failures" [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [13:08:57] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:09:36] 06SRE, 13Patch-For-Review: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11963172 (10elukey) Removed also these varnishkafka-related ACLs belonging to old/test topics: ` elukey@kafka-jumbo1010:~$ sudo -E kafka acls --remove --allow-principal User:CN=varnishkafka --operation Des... [13:10:35] (03PS6) 10Elukey: profile::kafka::broker: add ACLs in a file [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) [13:10:47] (03CR) 10WMDE-Fisch: Update VE core submodule to master (9cf5524e7) [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [13:11:16] (03CR) 10WMDE-Fisch: [C:03+2] "resubmit" [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [13:11:30] (03CR) 10Elukey: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [13:11:33] Damn unrelated failures. try to speed things up here. [13:11:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P93365 and previous config saved to /var/cache/conftool/dbconfig/20260528-131147-fceratto.json [13:13:34] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11963174 (10Papaul) [13:13:44] 06SRE, 13Patch-For-Review: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11963176 (10elukey) Removed also the remaining varnishkafka ones, they are mentioning old topics not used anymore: ` elukey@kafka-jumbo1010:~$ sudo -E kafka acls --remove --allow-principal User:CN=varnishk... [13:13:44] (03CR) 10Ladsgroup: "It looks completely unused, I was a bit worried it might be used in wmcs but 1- they should use openstack's network policies honestly 2- I" [puppet] - 10https://gerrit.wikimedia.org/r/1294248 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [13:14:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host urldownloader2005.wikimedia.org [13:14:12] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:15:10] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11963178 (10Papaul) [13:15:13] (03CR) 10CI reject: [V:04-1] Update VE core submodule to master (9cf5524e7) [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [13:15:41] 06SRE, 13Patch-For-Review: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11963179 (10elukey) Other ones: ` elukey@kafka-jumbo1010:~$ sudo -E kafka acls --remove --deny-principal User:ANONYMOUS --operation Write --topic atskafka_test_webrequest_text Root user detected, using the... [13:16:24] matthiasmullie: I don't have much time left to retrigger CI I need to skip my backport. [13:16:38] (03PS7) 10Elukey: profile::kafka::broker: add ACLs in a file [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) [13:16:49] So feel free to go on with yours I'm out :-) [13:17:09] (03CR) 10Elukey: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [13:17:18] WMDE-Fisch: thanks, and good luck next time! :) [13:17:42] !log clean up a lof ot stale Kafka ACLs on Kafka Jumbo - Details in T425528 [13:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:47] T425528: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528 [13:17:49] cc: brouberol --^ [13:18:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [13:19:23] (03CR) 10WMDE-Fisch: "Re-scheduled for Monday June 1st" [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [13:19:35] (03CR) 10WMDE-Fisch: "recheck" [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [13:19:56] RECOVERY - VRRP status on cr3-eqsin is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:20:05] jmm@cumin2002 makevm (PID 2047686) is awaiting input [13:20:11] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11963191 (10Papaul) [13:21:18] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [13:21:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294986 (https://phabricator.wikimedia.org/T427336) (owner: 10Matthias Mullie) [13:21:35] (03Abandoned) 10Gkyziridis: wgRestSandboxSpecs: Add LiftWing API OpenAPI specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) (owner: 10Gkyziridis) [13:21:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P93366 and previous config saved to /var/cache/conftool/dbconfig/20260528-132155-fceratto.json [13:21:59] (03CR) 10Elukey: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [13:22:31] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [13:24:17] (03CR) 10Elukey: "If krb2002 goes down we probably cause impact to Hadoop and Presto, not ideal but if we keep the maintenance short for krb1002 it may be a" [puppet] - 10https://gerrit.wikimedia.org/r/1294952 (https://phabricator.wikimedia.org/T421706) (owner: 10Elukey) [13:29:03] (03Merged) 10jenkins-bot: Image Carousel: check candidate pages [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294986 (https://phabricator.wikimedia.org/T427336) (owner: 10Matthias Mullie) [13:29:19] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1294986|Image Carousel: check candidate pages (T427336)]] [13:29:24] T427336: Carousel: Limit the feature to article pages only - https://phabricator.wikimedia.org/T427336 [13:30:32] (03CR) 10Elukey: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [13:31:05] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1294986|Image Carousel: check candidate pages (T427336)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:31:47] !log mlitn@deploy1003 mlitn: Continuing with deployment [13:32:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T426633)', diff saved to https://phabricator.wikimedia.org/P93367 and previous config saved to /var/cache/conftool/dbconfig/20260528-133202-fceratto.json [13:32:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [13:32:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1219 (T426633)', diff saved to https://phabricator.wikimedia.org/P93368 and previous config saved to /var/cache/conftool/dbconfig/20260528-133230-fceratto.json [13:33:41] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [13:34:16] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [13:36:00] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294986|Image Carousel: check candidate pages (T427336)]] (duration: 06m 40s) [13:36:04] T427336: Carousel: Limit the feature to article pages only - https://phabricator.wikimedia.org/T427336 [13:37:56] Backport done; rest of this window available for whoever may need it [13:38:54] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [13:39:06] (03CR) 10TChin: [C:03+1] stream: webrequest_page_view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294947 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [13:39:11] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [13:39:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T426633)', diff saved to https://phabricator.wikimedia.org/P93369 and previous config saved to /var/cache/conftool/dbconfig/20260528-133936-fceratto.json [13:40:07] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [13:40:28] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Sat 13 Jun 2026 01:40:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [13:40:42] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [13:42:44] (03CR) 10JavierMonton: [C:03+2] stream: webrequest_page_view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294947 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [13:43:06] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:43:49] (03CR) 10TChin: [C:03+1] stream: webrequest_page_view (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294947 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [13:44:50] (03Merged) 10jenkins-bot: stream: webrequest_page_view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294947 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [13:46:23] (03PS1) 10Dreamy Jazz: ImageContentLookup: Fix issue created by strict types [extensions/MediaModeration] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294998 (https://phabricator.wikimedia.org/T427505) [13:47:26] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 24): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8607/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [13:49:06] (03PS1) 10Dreamy Jazz: Enable hCaptcha for VisualEditor in group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295001 (https://phabricator.wikimedia.org/T425940) [13:49:11] jouncebot: nowandnext [13:49:11] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1300) [13:49:11] In 0 hour(s) and 40 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1430) [13:49:19] Will use scap [13:49:21] (03CR) 10Harroyo-wmf: [C:03+1] ImageContentLookup: Fix issue created by strict types [extensions/MediaModeration] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294998 (https://phabricator.wikimedia.org/T427505) (owner: 10Dreamy Jazz) [13:49:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P93370 and previous config saved to /var/cache/conftool/dbconfig/20260528-134944-fceratto.json [13:50:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/MediaModeration] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294998 (https://phabricator.wikimedia.org/T427505) (owner: 10Dreamy Jazz) [13:50:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295001 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [13:50:32] (03PS1) 10Jelto: miscweb: update wmf-navigator public config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295002 (https://phabricator.wikimedia.org/T414405) [13:51:25] (03Merged) 10jenkins-bot: Enable hCaptcha for VisualEditor in group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295001 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [13:52:42] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:52:46] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:53:04] (03CR) 10Muehlenhoff: [C:03+2] sre.cdn.roll-restart-reboot-ncredir: Fix aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1269402 (owner: 10Muehlenhoff) [13:53:58] (03PS4) 10Muehlenhoff: http-sso-django-login: Switch to firewall::service and restrict access [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) [13:54:12] (03CR) 10Muehlenhoff: [C:03+2] Blocklist more unused network protocols [puppet] - 10https://gerrit.wikimedia.org/r/1294272 (owner: 10Muehlenhoff) [13:55:08] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader2005.wikimedia.org - jmm@cumin2002" [13:55:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader2005.wikimedia.org - jmm@cumin2002" [13:55:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:55:14] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache urldownloader2005.wikimedia.org on all recursors [13:55:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) urldownloader2005.wikimedia.org on all recursors [13:55:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM urldownloader2005.wikimedia.org - jmm@cumin2002" [13:55:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM urldownloader2005.wikimedia.org - jmm@cumin2002" [13:56:38] PROBLEM - SSH on cp6015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:57:28] (03CR) 10Majavah: [C:03+1] http-sso-django-login: Switch to firewall::service and restrict access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [13:57:28] RECOVERY - SSH on cp6015 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:57:42] uh? [13:58:03] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp6015.drmrs.wmnet,service=(cdn|ats-be) [13:58:36] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:58:55] jmm@cumin2002 makevm (PID 2047686) is awaiting input [13:59:32] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6015 is OK: HTTP OK: HTTP/1.1 200 OK - 50031 bytes in 0.359 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:59:36] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6015 is OK: HTTP OK: HTTP/1.0 200 OK - 36930 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:59:36] PROBLEM - statsv Varnishkafka log producer on cp6015 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:59:41] it's depooled, something is up [13:59:43] downtiming [13:59:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P93371 and previous config saved to /var/cache/conftool/dbconfig/20260528-135951-fceratto.json [14:00:21] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp6015.drmrs.wmnet with reason: hardware down [14:00:24] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11963403 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c08bd354-2e81-45c8-96f3-c56a6ef9dd4c) set by sukhe@cumin1003 for 7 days, 0:00:00 on 1 host(s) and their services with reason: hardware down `... [14:00:34] PROBLEM - Confd vcl based reload on cp6010 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:00:36] PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:02:27] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11963406 (10ssingh) 05Resolved→03Open a:05ssingh→03BCornwall The host is acting up again and is depooled. It seems like this time it is memory errors but racadm points to nothing. I am going to leave it depooled whi... [14:02:36] RECOVERY - statsv Varnishkafka log producer on cp6015 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:06:18] (03CR) 10Jelto: [C:03+2] miscweb: update wmf-navigator public config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295002 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:06:57] FIRING: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:07:10] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp1114 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [14:07:10] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp1114 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [14:07:10] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1114 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [14:07:25] !ack [14:07:26] 8029 (ACKED) ProbeDown sre (2620:0:861:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqiad) [14:07:46] (03Merged) 10jenkins-bot: ImageContentLookup: Fix issue created by strict types [extensions/MediaModeration] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294998 (https://phabricator.wikimedia.org/T427505) (owner: 10Dreamy Jazz) [14:07:57] FIRING: ProbeDown: Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:02] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1294998|ImageContentLookup: Fix issue created by strict types (T427505)]], [[gerrit:1295001|Enable hCaptcha for VisualEditor in group 1 (T425940)]] [14:08:10] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1114 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-07-06 20:52:29 +0000 (expires in 39 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:08:10] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp1114 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-07-12 03:51:38 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:08:10] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp1114 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-08-05 06:34:20 +0000 (expires in 68 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:08:10] T427505: TypeError: Wikimedia\Mime\MimeAnalyzer::getMimeTypeFromExtensionOrNull(): Argument #1 ($ext) must be of type string, false given, called in /srv/mediawiki/php-1.47.0-wmf.4/extensions/MediaModeration/src/Services/MediaModeration - https://phabricator.wikimedia.org/T427505 [14:08:11] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [14:08:40] 06SRE, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#11963447 (10MLechvien-WMF) a:05Clement_Goubert→03None @jijiki are you handling that task too as part of {https://phabricator.wikimedia.org/T419976} ? [14:08:57] (03Merged) 10jenkins-bot: miscweb: update wmf-navigator public config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295002 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:09:06] o/ [14:09:36] the service is up but haproxy is logging this https://phabricator.wikimedia.org/P93372 [14:09:47] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1294998|ImageContentLookup: Fix issue created by strict types (T427505)]], [[gerrit:1295001|Enable hCaptcha for VisualEditor in group 1 (T425940)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:09:49] cdanis: ^ [14:09:56] I think we should tune it perhaps [14:10:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T426633)', diff saved to https://phabricator.wikimedia.org/P93373 and previous config saved to /var/cache/conftool/dbconfig/20260528-141001-fceratto.json [14:10:12] moving to -sec, just in case [14:10:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host urldownloader2005.wikimedia.org with OS trixie [14:10:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: Maintenance [14:10:28] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#11963458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host urldownloader2005.wikimedia.org with OS trixie [14:10:29] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/1273790 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:10:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T426633)', diff saved to https://phabricator.wikimedia.org/P93374 and previous config saved to /var/cache/conftool/dbconfig/20260528-141029-fceratto.json [14:11:56] (03PS7) 10Muehlenhoff: profile::zookeeper::firewall: Also allow passing a list of hosts [puppet] - 10https://gerrit.wikimedia.org/r/1272766 [14:11:57] RESOLVED: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:12:34] (03PS1) 10Majavah: P:puppetdb: Pull per-project Cumin hosts from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/1295003 (https://phabricator.wikimedia.org/T427518) [14:12:36] (03PS1) 10Majavah: P:openstack: cumin: target: Migrate to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1295004 (https://phabricator.wikimedia.org/T427518) [14:12:57] RESOLVED: ProbeDown: Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:13:05] !incidents [14:13:06] 8024 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [14:13:06] 8025 (ACKED) db2189 (paged)/MariaDB Replica IO: s2 (paged) [14:13:06] 8026 (ACKED) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [14:13:06] 8029 (RESOLVED) ProbeDown sre (2620:0:861:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqiad) [14:13:06] 8028 (RESOLVED) [2x] PyBalBGPUnstable lvs sre (lvs1017:9090 pybal 64600 eqiad) [14:13:26] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8609/co" [puppet] - 10https://gerrit.wikimedia.org/r/1295004 (https://phabricator.wikimedia.org/T427518) (owner: 10Majavah) [14:15:04] (03CR) 10Muehlenhoff: [C:03+2] Switch the orchestrator role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1240204 (owner: 10Muehlenhoff) [14:15:19] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [14:15:43] Apologies if my deploy is crashing into this next window (tests took a while to merge) [14:18:31] ACKNOWLEDGEMENT - Backup freshness on backup1014 is CRITICAL: All failures: 1 (backup1013), Fresh: 135 jobs Jcrespo expected - The acknowledgement expires at: 2026-05-29 14:18:17. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:18:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T426633)', diff saved to https://phabricator.wikimedia.org/P93375 and previous config saved to /var/cache/conftool/dbconfig/20260528-141846-fceratto.json [14:19:28] (03PS1) 10CDanis: haproxy: warn-blocked-traffic-after++ [puppet] - 10https://gerrit.wikimedia.org/r/1295007 [14:19:32] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294998|ImageContentLookup: Fix issue created by strict types (T427505)]], [[gerrit:1295001|Enable hCaptcha for VisualEditor in group 1 (T425940)]] (duration: 11m 29s) [14:19:38] T427505: TypeError: Wikimedia\Mime\MimeAnalyzer::getMimeTypeFromExtensionOrNull(): Argument #1 ($ext) must be of type string, false given, called in /srv/mediawiki/php-1.47.0-wmf.4/extensions/MediaModeration/src/Services/MediaModeration - https://phabricator.wikimedia.org/T427505 [14:19:38] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [14:19:56] My deploy is done [14:20:32] 10SRE-tools, 10Ceph, 06cloud-services-team, 10Cloud-VPS, and 2 others: Enhancements to wmcs.ceph.roll_reboot_osds - https://phabricator.wikimedia.org/T427295#11963500 (10BLiviero-WMF) @Andrew do we understand _why_ this is happening? what research has been done to develop the understanding? [14:20:45] (03CR) 10Andrew Bogott: [C:03+1] P:puppetdb: Pull per-project Cumin hosts from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/1295003 (https://phabricator.wikimedia.org/T427518) (owner: 10Majavah) [14:21:04] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11963505 (10MoritzMuehlenhoff) [14:21:07] (03CR) 10Majavah: [C:03+2] P:puppetdb: Pull per-project Cumin hosts from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/1295003 (https://phabricator.wikimedia.org/T427518) (owner: 10Majavah) [14:21:15] (03CR) 10Ssingh: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1295007/8610/cp1100.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/1295007 (owner: 10CDanis) [14:21:51] (03CR) 10Andrew Bogott: [C:03+1] P:openstack: cumin: target: Migrate to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1295004 (https://phabricator.wikimedia.org/T427518) (owner: 10Majavah) [14:22:00] (03CR) 10CDanis: [C:03+2] haproxy: warn-blocked-traffic-after++ [puppet] - 10https://gerrit.wikimedia.org/r/1295007 (owner: 10CDanis) [14:26:53] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: cumin: target: Migrate to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1295004 (https://phabricator.wikimedia.org/T427518) (owner: 10Majavah) [14:27:55] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:28:16] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on urldownloader2005.wikimedia.org with reason: host reimage [14:28:19] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:28:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P93376 and previous config saved to /var/cache/conftool/dbconfig/20260528-142854-fceratto.json [14:29:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294949 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [14:29:25] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279#11963597 (10Gehel) [14:29:47] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting Access to Analytics Data Lake - https://phabricator.wikimedia.org/T427197#11963599 (10Gehel) [14:30:03] (03PS3) 10Trueg: dse-k8s-eqiad: Add wdqs namespace for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1430) [14:32:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on urldownloader2005.wikimedia.org with reason: host reimage [14:33:50] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11963638 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @Marostegui all done. [14:35:20] !incidents [14:35:20] 8024 (UNACKED, 24h 00m old) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [14:35:20] 8025 (UNACKED, 24h 00m old) db2189 (paged)/MariaDB Replica IO: s2 (paged) [14:35:20] 8026 (UNACKED, 24h 00m old) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [14:35:21] 8029 (RESOLVED) ProbeDown sre (2620:0:861:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqiad) [14:35:21] 8028 (RESOLVED) [2x] PyBalBGPUnstable lvs sre (lvs1017:9090 pybal 64600 eqiad) [14:35:38] !ack [14:35:38] 8024 (ACKED, 24h 01m old) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [14:35:38] 8025 (ACKED, 24h 01m old) db2189 (paged)/MariaDB Replica IO: s2 (paged) [14:35:39] 8026 (ACKED, 24h 01m old) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [14:35:54] federico3: known issue? [14:36:18] yes, chasing it [14:36:55] I'm not sure why it's p.aging tho [14:37:09] ok, can I help? [14:38:05] no need, the host is up, pooled in, green on icinga [14:38:36] ok [14:39:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P93377 and previous config saved to /var/cache/conftool/dbconfig/20260528-143901-fceratto.json [14:39:22] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2189.codfw.wmnet [14:39:23] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2189.codfw.wmnet [14:41:13] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11963677 (10FCeratto-WMF) The host was shut down cleanly so I can check and repool it. [14:43:32] (03Abandoned) 10C. Scott Ananian: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280479 (owner: 10PipelineBot) [14:45:45] (03PS1) 10Ayounsi: Nokia: add missing Wikidough prefix [homer/public] - 10https://gerrit.wikimedia.org/r/1295011 (https://phabricator.wikimedia.org/T423384) [14:46:16] (03CR) 10Ssingh: [V:03+1] "1017 is not included in this list since we brought up the node yesterday and it hasn't made its way to PCC yet. Manually running PCC yield" [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [14:46:36] (03CR) 10Ssingh: [V:03+1] "Planning to deploy Mon June 1." [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [14:46:44] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11963699 (10FCeratto-WMF) 05Resolved→03In progress No error in the logs, replication is catching up. [14:46:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11963701 (10FCeratto-WMF) a:05Jhancock.wm→03FCeratto-WMF [14:47:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293805 (https://phabricator.wikimedia.org/T427331) (owner: 10Arlolra) [14:47:30] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:47:41] (03CR) 10Ssingh: [C:03+1] Nokia: add missing Wikidough prefix [homer/public] - 10https://gerrit.wikimedia.org/r/1295011 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi) [14:47:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host urldownloader2005.wikimedia.org with OS trixie [14:47:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host urldownloader2005.wikimedia.org [14:47:47] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#11963708 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host urldownloader2005.wikimedia.org with OS trixie completed: - urldownloader2005 (**PASS**) - Rem... [14:47:58] (03PS1) 10Elukey: role::pki::multirootca: remove kafka_11 profile [puppet] - 10https://gerrit.wikimedia.org/r/1295012 [14:47:59] (03CR) 10Muehlenhoff: [C:03+2] http-sso-django-login: Switch to firewall::service and restrict access [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [14:48:01] (03CR) 10Ayounsi: [C:03+2] Nokia: add missing Wikidough prefix [homer/public] - 10https://gerrit.wikimedia.org/r/1295011 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi) [14:48:09] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:49:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T426633)', diff saved to https://phabricator.wikimedia.org/P93378 and previous config saved to /var/cache/conftool/dbconfig/20260528-144909-fceratto.json [14:49:10] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295012 (owner: 10Elukey) [14:49:12] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11963713 (10MoritzMuehlenhoff) [14:49:29] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [14:49:32] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [14:49:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T426633)', diff saved to https://phabricator.wikimedia.org/P93379 and previous config saved to /var/cache/conftool/dbconfig/20260528-144936-fceratto.json [14:49:42] (03Merged) 10jenkins-bot: Nokia: add missing Wikidough prefix [homer/public] - 10https://gerrit.wikimedia.org/r/1295011 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi) [14:49:49] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [14:50:08] (03CR) 10CI reject: [V:04-1] role::pki::multirootca: remove kafka_11 profile [puppet] - 10https://gerrit.wikimedia.org/r/1295012 (owner: 10Elukey) [14:50:09] (03PS3) 10Dr0ptp4kt: Reactivate wikimedia.de email addresses for GrowthBook SSO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294372 (https://phabricator.wikimedia.org/T418665) [14:51:03] (03PS1) 10Elukey: role::pki::multirootca: remove kafka_11 profile [puppet] - 10https://gerrit.wikimedia.org/r/1295014 [14:51:07] (03PS1) 10Muehlenhoff: pontoon:lb: Restrict firewall services to CLOUD_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/1295015 (https://phabricator.wikimedia.org/T149804) [14:51:35] (03Abandoned) 10Elukey: role::pki::multirootca: remove kafka_11 profile [puppet] - 10https://gerrit.wikimedia.org/r/1295012 (owner: 10Elukey) [14:51:45] (03Abandoned) 10Elukey: role::pki::multirootca: remove kafka_11 profile [puppet] - 10https://gerrit.wikimedia.org/r/1295014 (owner: 10Elukey) [14:56:29] !log installing nginx security updates [14:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T426633)', diff saved to https://phabricator.wikimedia.org/P93380 and previous config saved to /var/cache/conftool/dbconfig/20260528-145646-fceratto.json [14:58:46] (03PS1) 10Elukey: profile::cache: remove the kafka_11 PKI profile [puppet] - 10https://gerrit.wikimedia.org/r/1295020 [14:58:46] (03PS1) 10Elukey: profile::frtech::kafka_certificate: remove kafka_11 profile's occurrence [puppet] - 10https://gerrit.wikimedia.org/r/1295021 [14:58:46] (03PS1) 10Elukey: profile::kafka: remove kafka_11 profile occurrences [puppet] - 10https://gerrit.wikimedia.org/r/1295022 [14:58:47] (03PS1) 10Elukey: role::pki::multirootca: remove the Kafka kafka_11 profile [puppet] - 10https://gerrit.wikimedia.org/r/1295023 [15:00:05] jnuche and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1500) [15:00:29] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295020 (owner: 10Elukey) [15:00:36] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295021 (owner: 10Elukey) [15:00:43] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295022 (owner: 10Elukey) [15:00:51] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295023 (owner: 10Elukey) [15:06:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P93381 and previous config saved to /var/cache/conftool/dbconfig/20260528-150653-fceratto.json [15:12:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:13:26] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11963851 (10MoritzMuehlenhoff) [15:13:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [15:14:12] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:14:50] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:17:00] !log dmarc ingress test on mx-in1001 [15:17:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P93382 and previous config saved to /var/cache/conftool/dbconfig/20260528-151701-fceratto.json [15:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:48] (03CR) 10Ayounsi: [C:03+1] Add new Eqsin subnet [puppet] - 10https://gerrit.wikimedia.org/r/1294487 (https://phabricator.wikimedia.org/T427393) (owner: 10Papaul) [15:17:56] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:18:44] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5032.* [15:20:23] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp5032.eqsin.wmnet with reason: Testing reimaging on new subnet [15:21:05] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11963925 (10BCornwall) Hi, @papaul: cp5032 is depooled/downtimed and ready for reimaging. [15:24:31] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11963940 (10Papaul) @BCornwall thank you will do that after lunch doing some onsite work [15:26:03] (03PS1) 10JavierMonton: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295036 (https://phabricator.wikimedia.org/T425624) [15:27:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T426633)', diff saved to https://phabricator.wikimedia.org/P93383 and previous config saved to /var/cache/conftool/dbconfig/20260528-152708-fceratto.json [15:27:29] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [15:27:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T426633)', diff saved to https://phabricator.wikimedia.org/P93384 and previous config saved to /var/cache/conftool/dbconfig/20260528-152736-fceratto.json [15:31:31] (03CR) 10JavierMonton: [C:03+2] stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295036 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [15:31:59] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for Audrey Penven - https://phabricator.wikimedia.org/T427531 (10AudreyPenven_WMDE) 03NEW [15:32:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11964031 (10Jclark-ctr) @bking @RKemper can puppet be updated for new host names? [15:33:15] (03Merged) 10jenkins-bot: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295036 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [15:35:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T426633)', diff saved to https://phabricator.wikimedia.org/P93385 and previous config saved to /var/cache/conftool/dbconfig/20260528-153550-fceratto.json [15:44:25] (03CR) 10Filippo Giunchedi: [C:03+1] pontoon:lb: Restrict firewall services to CLOUD_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/1295015 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [15:45:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P93386 and previous config saved to /var/cache/conftool/dbconfig/20260528-154557-fceratto.json [15:46:09] (03PS1) 10Alex.sanford: Add 2FA demotion config for phase 2 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295039 [15:46:36] (03PS2) 10Alex.sanford: Add 2FA demotion config for phase 2 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295039 (https://phabricator.wikimedia.org/T423119) [15:48:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295039 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [15:49:38] (03PS3) 10Alex.sanford: Add 2FA enforcement demotion config for phase 2 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295039 (https://phabricator.wikimedia.org/T423119) [15:56:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P93387 and previous config saved to /var/cache/conftool/dbconfig/20260528-155605-fceratto.json [15:56:26] (03PS1) 10Aklapper: Phabricator automated emails: Update some Development Metrics URIs [puppet] - 10https://gerrit.wikimedia.org/r/1295040 [15:57:40] (03CR) 10Dzahn: "aha! first I thought this means Bitergia has been replaced but looks like stuff just got moved in-house to use. nice" [puppet] - 10https://gerrit.wikimedia.org/r/1295040 (owner: 10Aklapper) [16:00:05] jhathaway and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:00:47] (03CR) 10Dzahn: [C:03+2] Phabricator automated emails: Update some Development Metrics URIs [puppet] - 10https://gerrit.wikimedia.org/r/1295040 (owner: 10Aklapper) [16:00:50] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:04:37] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294984 (https://phabricator.wikimedia.org/T407341) (owner: 10Santiago Faci) [16:05:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:05:13] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294985 (https://phabricator.wikimedia.org/T407341) (owner: 10Santiago Faci) [16:05:50] RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:06:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T426633)', diff saved to https://phabricator.wikimedia.org/P93388 and previous config saved to /var/cache/conftool/dbconfig/20260528-160613-fceratto.json [16:06:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1251.eqiad.wmnet with reason: Maintenance [16:06:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T426633)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260528-160646-fceratto.json [16:07:02] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294984 (https://phabricator.wikimedia.org/T407341) (owner: 10Santiago Faci) [16:07:37] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294985 (https://phabricator.wikimedia.org/T407341) (owner: 10Santiago Faci) [16:07:40] FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:24] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:09:44] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:10:46] PROBLEM - Host db1224 #page is DOWN: PING CRITICAL - Packet loss = 100% [16:10:53] (03CR) 10Trueg: dse-k8s-eqiad: Add wdqs namespace for the new deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [16:11:01] !ack [16:11:02] 8030 (ACKED) Host db1224 (paged) [16:14:30] !log reprepro include php-excimer_1.2.5-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:36] T427312: Build PHP 8.3 packages for bookworm - https://phabricator.wikimedia.org/T427312 [16:14:50] !log reprepro include php-imagick_3.7.0-13+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:14:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T426633)', diff saved to https://phabricator.wikimedia.org/P93390 and previous config saved to /var/cache/conftool/dbconfig/20260528-161452-fceratto.json [16:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:05] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [16:15:05] !log reprepro include php-luasandbox_4.1.2-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:15] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [16:15:26] !log reprepro include php-memcached_3.3.0-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:48] !log reprepro include php-pcov_1.0.12-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:07] !log reprepro include php-redis_6.2.0-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:24] !log reprepro include php-uuid_1.3.0-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:44] !log reprepro include php-wmerrors_2.0.0-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:56] !log reprepro include php-xhprof_2.3.10-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:11] !log reprepro include php-yaml_2.2.4-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:27] !log reprepro include wikidiff2_1.14.1-2+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:41] !log reprepro include xdebug_3.4.4-1+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [16:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:54] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1294392/8611/contint1002.wikimedia.org/change.contint1002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/1294392 (owner: 10Dzahn) [16:19:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:58] 10ops-eqiad, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535 (10FCeratto-WMF) 03NEW [16:20:05] 10ops-eqiad, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11964240 (10FCeratto-WMF) getsel: ` ------------------------------------------------------------------------------- Record: 57 Date/Time: 05/28/2026 15:06:07 Source: system Severity: Ok Description:... [16:20:38] (03PS3) 10Dzahn: CI: better naming; avoid using terms "new" and "legacy" [puppet] - 10https://gerrit.wikimedia.org/r/1294392 [16:21:03] 10ops-eqiad, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11964242 (10FCeratto-WMF) Dashboard: https://grafana.wikimedia.org/goto/cfnfwwukbq0hsd?orgId=1 [16:21:25] 10ops-eqiad, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11964243 (10FCeratto-WMF) p:05Triage→03High [16:22:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 99 days, 0:00:00 on db1224.eqiad.wmnet with reason: unreachable T427535 [16:22:51] T427535: db1224 is unreachable - https://phabricator.wikimedia.org/T427535 [16:24:31] 10ops-eqiad, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11964254 (10FCeratto-WMF) [16:25:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P93391 and previous config saved to /var/cache/conftool/dbconfig/20260528-162459-fceratto.json [16:27:32] (03CR) 10Dzahn: [V:03+1] "Antoine, is this better naming?" [puppet] - 10https://gerrit.wikimedia.org/r/1294392 (owner: 10Dzahn) [16:29:01] (03PS4) 10Dzahn: CI: better naming; avoid using terms "new" and "legacy" [puppet] - 10https://gerrit.wikimedia.org/r/1294392 [16:33:04] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on mc-gp2005:9290 - https://phabricator.wikimedia.org/T427410#11964297 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm PSU replaced. alert cleared. returning damaged PSU [16:35:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P93392 and previous config saved to /var/cache/conftool/dbconfig/20260528-163507-fceratto.json [16:36:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in ms-be2089 - https://phabricator.wikimedia.org/T427266#11964330 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm stock replacement arrived. [16:36:29] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:36:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:37:29] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:37:35] RECOVERY - Confd vcl based reload on cp6010 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [16:37:37] RECOVERY - Confd vcl based reload on cp6011 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [16:39:35] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [16:39:35] PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [16:45:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T426633)', diff saved to https://phabricator.wikimedia.org/P93393 and previous config saved to /var/cache/conftool/dbconfig/20260528-164514-fceratto.json [16:55:37] (03PS1) 10Scott French: php8.3: Rebuild 8.3 image stack on bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295044 [16:57:24] (03CR) 10Scott French: [V:03+2] "Built locally: https://phabricator.wikimedia.org/T427312#11964352" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295044 (owner: 10Scott French) [16:57:38] (03PS1) 10Dr0ptp4kt: WIP DNM: Add commonswiki globalimagelinks monthly sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1295045 (https://phabricator.wikimedia.org/T427532) [16:58:31] (03PS1) 10Dr0ptp4kt: WIP DNM: Add filerevision to the mediawiki not-history sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1295047 (https://phabricator.wikimedia.org/T427532) [16:58:44] (03CR) 10Scott French: [V:03+2 C:04-1] "Holding onto this until we're ready to proceed with the first migration (Shellbox)." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295044 (owner: 10Scott French) [16:59:05] (03CR) 10Catrope: [C:03+1] Add 2FA enforcement demotion config for phase 2 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295039 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [16:59:42] (03CR) 10CI reject: [V:04-1] WIP DNM: Add commonswiki globalimagelinks monthly sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1295045 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [17:00:05] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T1700) [17:00:41] (03CR) 10CI reject: [V:04-1] WIP DNM: Add filerevision to the mediawiki not-history sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1295047 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [17:07:31] * bd808 looks to see what is geployable [17:07:40] *deployable [17:09:13] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:10:31] (03PS1) 10BryanDavis: developer-portal: Bump container to 2026-05-28-121950-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295048 [17:13:41] (03CR) 10Dzahn: "Generally I am thinking:" [puppet] - 10https://gerrit.wikimedia.org/r/1178874 (https://phabricator.wikimedia.org/T378028) (owner: 10AOkoth) [17:13:51] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2026-05-28-121950-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295048 (owner: 10BryanDavis) [17:15:39] (03CR) 10JHathaway: [C:03+1] firewall::client: Fix default for qos [puppet] - 10https://gerrit.wikimedia.org/r/1294948 (owner: 10Majavah) [17:15:55] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2026-05-28-121950-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295048 (owner: 10BryanDavis) [17:16:51] (03CR) 10JHathaway: [C:03+1] P:sre::nftables_compat_check: Install python3-pypuppetdb [puppet] - 10https://gerrit.wikimedia.org/r/1294951 (owner: 10Majavah) [17:18:00] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:18:16] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:18:24] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:18:46] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:18:53] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:19:22] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:19:54] That's my deployment window sorted. [17:24:37] (03CR) 10Dzahn: "separate from my general comments, a specific issue with the current PS is here:" [puppet] - 10https://gerrit.wikimedia.org/r/1178874 (https://phabricator.wikimedia.org/T378028) (owner: 10AOkoth) [17:29:50] (03CR) 10CDanis: [C:03+1] Refactor the backend regex in ATSBackendErrorsHigh (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1293839 (owner: 10RLazarus) [17:29:57] (03PS6) 10Andrew Bogott: designate: remove leftover mcrouter code [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T427189) [17:30:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T427189) (owner: 10Andrew Bogott) [17:31:10] (03CR) 10RLazarus: [C:03+2] Refactor the backend regex in ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1293839 (owner: 10RLazarus) [17:31:25] (03CR) 10RLazarus: [C:03+2] Refactor the backend regex in ATSBackendErrorsHigh (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1293839 (owner: 10RLazarus) [17:32:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission lvs1016.eqiad.wmnet - https://phabricator.wikimedia.org/T427451#11964544 (10BCornwall) @Jclark-ctr yeah, go for it. Thanks! [17:32:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission lvs1016.eqiad.wmnet - https://phabricator.wikimedia.org/T427451#11964545 (10Jclark-ctr) a:05BCornwall→03Jclark-ctr [17:33:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission lvs1016.eqiad.wmnet - https://phabricator.wikimedia.org/T427451#11964548 (10Jclark-ctr) A7 U27 [17:33:35] (03Merged) 10jenkins-bot: Refactor the backend regex in ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1293839 (owner: 10RLazarus) [17:35:04] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11964558 (10VRiley-WMF) a:03VRiley-WMF [17:43:06] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:53:21] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11964662 (10VRiley-WMF) @FCeratto-WMF This unit shoud be reachable again. iDRAC scans hardware during boot. If it detects an error Dell classifies as uncorrectable, iDRAC will not rescan the device until... [17:56:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [17:57:57] !log Stopping pybal/puppet/downtiming lvs2013.codfw.wmnet for reboot [17:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:01] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2013.codfw.wmnet with reason: Kernel reboot [18:02:25] RESOLVED: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:03:52] (03PS2) 10Scott French: P:mediawiki::php: Support debian bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1295057 [18:06:31] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2013.codfw.wmnet [18:09:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2013.codfw.wmnet [18:09:31] PROBLEM - Host lvs2013 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:46] ^expected, ignore [18:10:05] (03PS1) 10CDanis: cache: contact_info: test Pywikibot format [puppet] - 10https://gerrit.wikimedia.org/r/1295060 (https://phabricator.wikimedia.org/T427491) [18:10:07] (03PS1) 10CDanis: cache: contact_info_text: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1295061 [18:10:15] RECOVERY - Host lvs2013 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [18:10:29] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:10:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:10:45] ^expected, ignore [18:13:01] RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:13:03] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:13:32] (03PS2) 10CDanis: cache: contact_info: test Pywikibot format [puppet] - 10https://gerrit.wikimedia.org/r/1295060 (https://phabricator.wikimedia.org/T427491) [18:13:32] (03PS2) 10CDanis: cache: contact_info_text: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1295061 [18:16:01] (03CR) 10Papaul: [C:03+2] Add new Eqsin subnet [puppet] - 10https://gerrit.wikimedia.org/r/1294487 (https://phabricator.wikimedia.org/T427393) (owner: 10Papaul) [18:17:47] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11964743 (10VRiley-WMF) @FCeratto-WMF If you'd like, I can also update the firmware before it's fully handed over as well? either way, let me know [18:19:17] !log Stopping pybal/puppet/downtiming lvs2011.codfw.wmnet for reboot [18:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:52] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: Kernel reboot [18:24:52] (03CR) 10Giuseppe Lavagetto: [C:03+1] cache: contact_info_text: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1295061 (owner: 10CDanis) [18:25:12] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T427553 (10APDube-WMF) 03NEW [18:25:33] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2011.codfw.wmnet [18:26:19] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T427553#11964770 (10APDube-WMF) @ccasilli Could you please provide approval for the request? [18:28:32] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2011.codfw.wmnet [18:28:40] (03CR) 10Bking: "Thanks Ryan! PCC looks good, but will wait for @cwhite@wikimedia.org 's +1 before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [18:29:13] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:29:19] ^expected, ignore [18:29:29] PROBLEM - pybal on lvs2011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:29:35] nice thanks [18:30:05] PROBLEM - PyBal connections to etcd on lvs2011 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [18:30:05] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2011 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:31:05] RECOVERY - pybal on lvs2011 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:31:05] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:31:17] RECOVERY - PyBal connections to etcd on lvs2011 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [18:31:17] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2011 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:33:55] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:34:42] !log Stopping pybal/puppet/downtiming lvs1019.eqiad.wmnet for reboot and BIOS update/memory self-healing - T426109 [18:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:48] T426109: Reboot lvs1019 for memory self-healing - https://phabricator.wikimedia.org/T426109 [18:35:09] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1019.eqiad.wmnet with reason: Kernel reboot [18:35:54] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:40:50] !log planet1003/planet2003 - apt-get upgrade - all pending package upgrades [18:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T427487#11964810 (10VRiley-WMF) [18:43:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T427487#11964814 (10VRiley-WMF) 05Open→03Resolved This is completed [18:46:05] (03PS3) 10CDanis: cache: contact_info: test Pywikibot format [puppet] - 10https://gerrit.wikimedia.org/r/1295060 (https://phabricator.wikimedia.org/T427491) [18:46:05] (03PS3) 10CDanis: cache: contact_info_text: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1295061 [18:46:05] (03PS2) 10CDanis: cache::haproxy: limit email addresses to reasonable lengths [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto) [18:47:09] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:48:18] (03CR) 10CDanis: cache::haproxy: limit email addresses to reasonable lengths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto) [18:48:19] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11964825 (10FCeratto-WMF) @VRiley-WMF the host is not responding on ssh and not generating metrics so maybe it did not power up. Please update the firmware and tomorrow I'll try to powercycle it. [18:48:52] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279#11964829 (10Ahoelzl) Approved. [18:49:08] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting Access to Analytics Data Lake - https://phabricator.wikimedia.org/T427197#11964831 (10Ahoelzl) Approved. [18:50:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-eqiad:xe-0/0/32 (Transport: lvs1019:enp94s0f0np0 (Equinix, 21989994) {#20220411}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:51:17] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change cp5032 IP - pt1979@cumin2002" [18:51:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change cp5032 IP - pt1979@cumin2002" [18:51:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:52:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS trixie [18:52:30] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11964838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cp5032.eqsin.wmnet with OS... [18:53:31] !incidents [18:53:31] 8024 (ACKED, 28h 18m old) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [18:53:32] 8025 (ACKED, 28h 18m old) db2189 (paged)/MariaDB Replica IO: s2 (paged) [18:53:32] 8026 (ACKED, 28h 18m old) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [18:53:32] 8030 (ACKED) Host db1224 (paged) [18:53:32] 8029 (RESOLVED) ProbeDown sre (2620:0:861:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqiad) [18:55:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-eqiad:xe-0/0/32 (Transport: lvs1019:enp94s0f0np0 (Equinix, 21989994) {#20220411}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:03:11] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T427553#11964870 (10ccasilli) Approved! {F84950579} [19:05:08] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs1019.eqiad.wmnet [19:05:09] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1019.eqiad.wmnet [19:07:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11964876 (10VRiley-WMF) Understood, I'll continue to look into this [19:09:13] !log Stopping pybal/puppet/downtiming lvs1018.eqiad.wmnet for reboot [19:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:44] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1018.eqiad.wmnet with reason: Kernel reboot [19:13:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:13:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:18:24] (03CR) 10Gmodena: wdqs-backend: Deployment chart for the WDQS triple-store (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [19:23:16] (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.24.0-a6 [vendor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295066 (https://phabricator.wikimedia.org/T420336) [19:23:30] if anyone happens to see this -- I was going to deploy Santi's config patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1285412 in the upcoming deployment window but have to run out -- it can ride along with other config patches [19:23:36] (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.24.0-a6 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295067 (https://phabricator.wikimedia.org/T427082) [19:25:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295067 (https://phabricator.wikimedia.org/T427082) (owner: 10Arlolra) [19:27:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Investigate hardware RAID usage in codfw LVS hosts - https://phabricator.wikimedia.org/T426912#11964942 (10BCornwall) @ssingh @BBlack Okay with me switching write-back to write-through slowly through the codfw cluster or shall we leave this as-is until the refresh? [19:27:56] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs1018.eqiad.wmnet [19:27:57] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1018.eqiad.wmnet [19:30:13] (03CR) 10Gmodena: dse-k8s-eqiad: Add wdqs namespace for the new deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [19:30:58] (03CR) 10BCornwall: [C:03+1] cache: contact_info: test Pywikibot format [puppet] - 10https://gerrit.wikimedia.org/r/1295060 (https://phabricator.wikimedia.org/T427491) (owner: 10CDanis) [19:33:30] (03CR) 10BCornwall: [C:03+1] "How were tests passing before?" [puppet] - 10https://gerrit.wikimedia.org/r/1295061 (owner: 10CDanis) [19:35:40] (03CR) 10CDanis: [C:03+2] cache: contact_info: test Pywikibot format [puppet] - 10https://gerrit.wikimedia.org/r/1295060 (https://phabricator.wikimedia.org/T427491) (owner: 10CDanis) [19:35:43] (03CR) 10CDanis: [C:03+2] "`txn.req_contact_info`, a non-existent entry in the table, is indeed also equal to `nil` 🙃" [puppet] - 10https://gerrit.wikimedia.org/r/1295061 (owner: 10CDanis) [19:41:47] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T427553#11964998 (10Aklapper) @ccasilli The file which you posted is not visible as it is not attached to this task; see https://www.mediawiki.org/wiki/Phabricator/Help#Uploading_file_attachments... [19:42:26] (03PS1) 10Bking: dse-k8s: Create kubeconfigs for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/1295068 (https://phabricator.wikimedia.org/T425007) [19:44:32] (03CR) 10BCornwall: [C:03+1] "Getting the spidey sense that the "parts" section of the email could be optimized a little further but it looks sound to me as-is." [puppet] - 10https://gerrit.wikimedia.org/r/1240174 (owner: 10Giuseppe Lavagetto) [19:46:11] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T427553#11965007 (10Aklapper) @APDube-WMF: Hi and welcome! Please also [link your Developer account (=LDAP) to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/external/)... [19:50:33] (03PS2) 10Dr0ptp4kt: WIP DNM: Add commonswiki globalimagelinks monthly sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1295045 (https://phabricator.wikimedia.org/T427532) [19:51:15] (03PS2) 10Dr0ptp4kt: WIP DNM: Add filerevision to the mediawiki not-history sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1295047 (https://phabricator.wikimedia.org/T427532) [19:55:13] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11965021 (10BCornwall) @ssingh What made you suspect mem errors? I see from the previous boot that OOM kept getting invoked on purged. [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T2000). Please do the needful. [20:00:05] Dreamy_Jazz, sfaci, RoanKattouw, Tran, arlolra, and alexsanford: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] o/ [20:00:11] hey [20:00:19] o/ [20:00:30] Unfortunately I am not able to deploy today (and cjming isn't either). Would someone else be able to take these? [20:01:10] I can do some deploys [20:01:47] I can self-deploy, can help with other deploys if they can be deployed via spiderpig [20:02:04] I can also self-deploy, or happy to have mine bundled in with other config changes [20:02:52] well we should get started. Dreamy_Jazz are you around? [20:05:23] doesn't look like it [20:05:30] sfaci? [20:05:48] Tran maybe get started with yours, RoanKattouw and alexsanford ? [20:06:02] Forgot I had an item in this window 😄 [20:06:04] quoting from earlier fyi: 19:23:30 if anyone happens to see this -- I was going to deploy Santi's config patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1285412 in the upcoming deployment window but have to run out -- it can ride along with other config patches [20:06:23] I can wait for others to go first [20:07:16] Dreamy_Jazz: I'm deploying a bunch of configs, can yours go with them? [20:07:48] Yeah they can [20:07:52] If you want [20:08:08] My patch should be a no op [20:08:13] In terms of functionality [20:09:02] k I'm going to do mine, Dreamy_Jazz, RoanKattouw, and alexsanford [20:09:30] Thanks [20:09:32] thx [20:10:06] Sounds good, mine is low risk and can be bundled [20:11:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291996 (https://phabricator.wikimedia.org/T426981) (owner: 10Dreamy Jazz) [20:11:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294393 (owner: 10Catrope) [20:11:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294229 (https://phabricator.wikimedia.org/T427369) (owner: 10STran) [20:11:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295039 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [20:12:41] (03Merged) 10jenkins-bot: Replace deprecated Hooks::getInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291996 (https://phabricator.wikimedia.org/T426981) (owner: 10Dreamy Jazz) [20:12:45] (03Merged) 10jenkins-bot: Permissions: Create wmf-officeit group on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294393 (owner: 10Catrope) [20:12:47] (03Merged) 10jenkins-bot: Deploy IRS Direct Reporting feature to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294229 (https://phabricator.wikimedia.org/T427369) (owner: 10STran) [20:12:51] (03Merged) 10jenkins-bot: Add 2FA enforcement demotion config for phase 2 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295039 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [20:13:08] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1291996|Replace deprecated Hooks::getInstance (T426981)]], [[gerrit:1294393|Permissions: Create wmf-officeit group on officewiki]], [[gerrit:1294229|Deploy IRS Direct Reporting feature to enwiki (T427369)]], [[gerrit:1295039|Add 2FA enforcement demotion config for phase 2 groups (T423119)]] [20:13:16] T426981: Replace uses of deprecated Hooks methods - https://phabricator.wikimedia.org/T426981 [20:13:17] T427369: Deploy direct reporting to enwiki - https://phabricator.wikimedia.org/T427369 [20:13:17] T423119: FY25-26 Q4: Phase 2 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423119 [20:13:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5032.eqsin.wmnet with OS trixie [20:14:02] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11965068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cp5032.eqsin.wmnet with OS trix... [20:14:53] !log stran@deploy1003 alexsanford, stran, catrope, dreamyjazz: Backport for [[gerrit:1291996|Replace deprecated Hooks::getInstance (T426981)]], [[gerrit:1294393|Permissions: Create wmf-officeit group on officewiki]], [[gerrit:1294229|Deploy IRS Direct Reporting feature to enwiki (T427369)]], [[gerrit:1295039|Add 2FA enforcement demotion config for phase 2 groups (T423119)]] synced to the testservers (see https://wikitech. [20:14:53] wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:15:44] Dreamy_Jazz, RoanKattouw, and alexsanford if you have testing to do for yours, now's the time [20:15:50] on it [20:17:24] Mine is good to go [20:17:56] Same, I think everyone else is a no-op so I'm moving forward. [20:18:02] !log stran@deploy1003 alexsanford, stran, catrope, dreamyjazz: Continuing with deployment [20:22:16] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1291996|Replace deprecated Hooks::getInstance (T426981)]], [[gerrit:1294393|Permissions: Create wmf-officeit group on officewiki]], [[gerrit:1294229|Deploy IRS Direct Reporting feature to enwiki (T427369)]], [[gerrit:1295039|Add 2FA enforcement demotion config for phase 2 groups (T423119)]] (duration: 09m 07s) [20:22:23] T426981: Replace uses of deprecated Hooks methods - https://phabricator.wikimedia.org/T426981 [20:22:23] T427369: Deploy direct reporting to enwiki - https://phabricator.wikimedia.org/T427369 [20:22:24] T423119: FY25-26 Q4: Phase 2 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423119 [20:23:09] arlolra: I'm done. Haven't heard from sfaci so I think you're good to start. [20:23:34] I can get started rzl left a message about that one. I'll combine it with my config [20:26:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293805 (https://phabricator.wikimedia.org/T427331) (owner: 10Arlolra) [20:26:37] Actually, nevermind, it says that patch isn't rollback safe [20:27:18] (03Merged) 10jenkins-bot: Deploy PRV to 7 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293805 (https://phabricator.wikimedia.org/T427331) (owner: 10Arlolra) [20:27:34] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1293805|Deploy PRV to 7 wikis (T427331)]] [20:27:38] T427331: Parsoid Read Views to deploy ~2026-05-28 - https://phabricator.wikimedia.org/T427331 [20:29:26] !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1293805|Deploy PRV to 7 wikis (T427331)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:30:45] !log arlolra@deploy1003 arlolra: Continuing with deployment [20:30:49] (03PS3) 10Dr0ptp4kt: Add filerevision to the mediawiki not-history sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1295047 (https://phabricator.wikimedia.org/T427532) [20:31:49] (03PS3) 10Dr0ptp4kt: Add commonswiki globalimagelinks monthly sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1295045 (https://phabricator.wikimedia.org/T427532) [20:34:54] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293805|Deploy PRV to 7 wikis (T427331)]] (duration: 07m 20s) [20:34:59] T427331: Parsoid Read Views to deploy ~2026-05-28 - https://phabricator.wikimedia.org/T427331 [20:36:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295066 (https://phabricator.wikimedia.org/T420336) (owner: 10Arlolra) [20:36:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295067 (https://phabricator.wikimedia.org/T427082) (owner: 10Arlolra) [20:38:00] RECOVERY - Host wdqs1015 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:40:04] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a6 [vendor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295066 (https://phabricator.wikimedia.org/T420336) (owner: 10Arlolra) [20:40:55] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a6 [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295067 (https://phabricator.wikimedia.org/T427082) (owner: 10Arlolra) [20:41:14] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1295066|Bump wikimedia/parsoid to 0.24.0-a6 (T420336 T427098 T427354 T427082)]], [[gerrit:1295067|Bump wikimedia/parsoid to 0.24.0-a6 (T427082)]] [20:41:24] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [20:41:24] T427098: mw:Param meta tags can sometimes leak through unprocessed - https://phabricator.wikimedia.org/T427098 [20:41:25] T427354: PHP Deprecated: Use of Wikimedia\Parsoid\Core\DomPageBundle::toDom without SiteConfig was deprecated in Parsoid 0.24. [Called from Wikimedia\Parsoid\Utils\ComputeSelectiveStats::classify] - https://phabricator.wikimedia.org/T427354 [20:41:25] T427082: CTT tasks week of 2026-05-22 - https://phabricator.wikimedia.org/T427082 [20:43:02] !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1295066|Bump wikimedia/parsoid to 0.24.0-a6 (T420336 T427098 T427354 T427082)]], [[gerrit:1295067|Bump wikimedia/parsoid to 0.24.0-a6 (T427082)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:44:33] !log arlolra@deploy1003 arlolra: Continuing with deployment [20:47:47] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:48:48] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295066|Bump wikimedia/parsoid to 0.24.0-a6 (T420336 T427098 T427354 T427082)]], [[gerrit:1295067|Bump wikimedia/parsoid to 0.24.0-a6 (T427082)]] (duration: 07m 34s) [20:48:56] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [20:48:57] T427098: mw:Param meta tags can sometimes leak through unprocessed - https://phabricator.wikimedia.org/T427098 [20:48:57] T427354: PHP Deprecated: Use of Wikimedia\Parsoid\Core\DomPageBundle::toDom without SiteConfig was deprecated in Parsoid 0.24. [Called from Wikimedia\Parsoid\Utils\ComputeSelectiveStats::classify] - https://phabricator.wikimedia.org/T427354 [20:48:57] T427082: CTT tasks week of 2026-05-22 - https://phabricator.wikimedia.org/T427082 [20:49:35] And that concludes the backport window [21:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T2100) [21:04:06] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "setup new eqsin vlan - pt1979@cumin2002 - T427393" [21:04:11] T427393: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393 [21:04:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "setup new eqsin vlan - pt1979@cumin2002 - T427393" [21:07:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cp5032.eqsin.wmnet [21:09:13] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:10:44] pt1979@cumin2002 dhcp (PID 2219777) is awaiting input [21:11:42] Deploying security patch for T426889 [21:21:22] !log Deployed security fix for T426889 [21:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:19] Deploying security patch for T426867 [21:25:10] (03PS1) 10Bking: wdqs-alternatives: Install 's3cmd' package [puppet] - 10https://gerrit.wikimedia.org/r/1295077 [21:25:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295077 (owner: 10Bking) [21:25:50] scap is running [21:26:16] (03PS1) 10JHathaway: WIP: rowlf-pp [puppet] - 10https://gerrit.wikimedia.org/r/1295078 [21:26:54] (03CR) 10CI reject: [V:04-1] WIP: rowlf-pp [puppet] - 10https://gerrit.wikimedia.org/r/1295078 (owner: 10JHathaway) [21:33:40] !log Deployed security fix for T426867 [21:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:25] (03CR) 10Bking: [C:03+2] wdqs-alternatives: Install 's3cmd' package [puppet] - 10https://gerrit.wikimedia.org/r/1295077 (owner: 10Bking) [21:43:06] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:57:41] (03PS4) 10Cwhite: OpenSearch: Add required config for bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [22:00:23] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [22:07:55] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [22:24:08] jouncebot: nowandnext [22:24:08] No deployments scheduled for the next 7 hour(s) and 35 minute(s) [22:24:08] In 7 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260529T0600) [22:24:23] (03CR) 10Cwhite: [C:03+1] performance.w.o: restrict blackbox check to ip4 [puppet] - 10https://gerrit.wikimedia.org/r/1293091 (https://phabricator.wikimedia.org/T425299) (owner: 10Tiziano Fogli) [22:31:01] !log dreamyjazz Deployed security patch for T426388 [22:34:05] !log reprepro includedeb trixie-wikimedia /home/andrew/magnum-cluster-api_0.36.6-1~wmf13u2_amd64.deb [22:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:24] (03PS2) 10Komla Sapaty: profile::toolforge::bastion: add SSH login activity export timer [puppet] - 10https://gerrit.wikimedia.org/r/1294864 [23:02:03] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:07:13] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ae1.522 interface - pt1979@cumin2002" [23:07:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ae1.522 interface - pt1979@cumin2002" [23:07:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:09:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:13:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:13:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:36:43] (03PS1) 10Santiago Faci: test-kitchen: Update chart to add a new config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295092 (https://phabricator.wikimedia.org/T421803) [23:39:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1295093 [23:39:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1295093 (owner: 10TrainBranchBot) [23:44:58] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295094 (https://phabricator.wikimedia.org/T421803) [23:50:05] (03PS2) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295094 (https://phabricator.wikimedia.org/T421803) [23:53:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1295093 (owner: 10TrainBranchBot) [23:53:49] (03PS3) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295094 (https://phabricator.wikimedia.org/T421803)