[00:00:34] (03CR) 10BCornwall: "Please update the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [00:00:45] (03CR) 10BCornwall: "Marking unresolved" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [00:04:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:05:25] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:07] (03PS1) 10Dzahn: contint: in httpd include proxy configs individually, not by wildcard [puppet] - 10https://gerrit.wikimedia.org/r/1306445 (https://phabricator.wikimedia.org/T418521) [00:06:59] (03CR) 10Dzahn: "this is a noop:" [puppet] - 10https://gerrit.wikimedia.org/r/1306445 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:08:37] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [00:08:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T410589)', diff saved to https://phabricator.wikimedia.org/P94549 and previous config saved to /var/cache/conftool/dbconfig/20260630-000844-ladsgroup.json [00:08:49] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [00:14:01] (03PS1) 10Dzahn: contint: stop including local jenkins proxy, switch ext proxy to /ci [puppet] - 10https://gerrit.wikimedia.org/r/1306446 (https://phabricator.wikimedia.org/T418521) [00:15:03] (03PS2) 10Dzahn: contint: stop including local jenkins proxy, switch ext proxy to /ci [puppet] - 10https://gerrit.wikimedia.org/r/1306446 (https://phabricator.wikimedia.org/T418521) [00:16:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T410589)', diff saved to https://phabricator.wikimedia.org/P94550 and previous config saved to /var/cache/conftool/dbconfig/20260630-001634-ladsgroup.json [00:16:39] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [00:17:02] (03CR) 10Ladsgroup: "I go bother my teammates tomorrow so each one take care of their own stuff!" [puppet] - 10https://gerrit.wikimedia.org/r/1305988 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [00:18:03] (03PS3) 10Dzahn: contint: stop including local jenkins proxy, switch ext proxy to /ci [puppet] - 10https://gerrit.wikimedia.org/r/1306446 (https://phabricator.wikimedia.org/T418521) [00:20:01] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1306446/8812/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1306446 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:26:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P94551 and previous config saved to /var/cache/conftool/dbconfig/20260630-002642-ladsgroup.json [00:32:07] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12068530 (10Papaul) @BCornwall about 40 minutes from adding the image to reboot [00:36:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P94552 and previous config saved to /var/cache/conftool/dbconfig/20260630-003650-ladsgroup.json [00:46:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T410589)', diff saved to https://phabricator.wikimedia.org/P94553 and previous config saved to /var/cache/conftool/dbconfig/20260630-004657-ladsgroup.json [00:47:03] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [00:47:14] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [00:47:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T410589)', diff saved to https://phabricator.wikimedia.org/P94554 and previous config saved to /var/cache/conftool/dbconfig/20260630-004721-ladsgroup.json [00:53:25] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2063.codfw.wmnet with OS trixie [00:55:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T410589)', diff saved to https://phabricator.wikimedia.org/P94555 and previous config saved to /var/cache/conftool/dbconfig/20260630-005517-ladsgroup.json [00:55:23] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:05:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P94556 and previous config saved to /var/cache/conftool/dbconfig/20260630-010525-ladsgroup.json [01:07:38] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:11:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.47.0-wmf.9 [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306451 (https://phabricator.wikimedia.org/T423918) [01:11:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.47.0-wmf.9 [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306451 (https://phabricator.wikimedia.org/T423918) (owner: 10TrainBranchBot) [01:12:01] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2063.codfw.wmnet with reason: host reimage [01:12:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306452 [01:12:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306452 (owner: 10TrainBranchBot) [01:15:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P94557 and previous config saved to /var/cache/conftool/dbconfig/20260630-011533-ladsgroup.json [01:18:19] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2063.codfw.wmnet with reason: host reimage [01:19:08] (03Merged) 10jenkins-bot: Branch commit for wmf/1.47.0-wmf.9 [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306451 (https://phabricator.wikimedia.org/T423918) (owner: 10TrainBranchBot) [01:20:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306452 (owner: 10TrainBranchBot) [01:25:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T410589)', diff saved to https://phabricator.wikimedia.org/P94558 and previous config saved to /var/cache/conftool/dbconfig/20260630-012540-ladsgroup.json [01:25:46] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:25:58] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [01:26:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T410589)', diff saved to https://phabricator.wikimedia.org/P94559 and previous config saved to /var/cache/conftool/dbconfig/20260630-012605-ladsgroup.json [01:33:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T410589)', diff saved to https://phabricator.wikimedia.org/P94560 and previous config saved to /var/cache/conftool/dbconfig/20260630-013329-ladsgroup.json [01:33:34] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:37:42] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2063.codfw.wmnet with OS trixie [01:43:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P94561 and previous config saved to /var/cache/conftool/dbconfig/20260630-014337-ladsgroup.json [01:44:42] !log jasmine@cumin2002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker1163.eqiad.wmnet [01:44:47] !log jasmine@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1163.eqiad.wmnet [01:45:22] !log jasmine@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1163.eqiad.wmnet [01:45:59] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker1163.eqiad.wmnet with OS trixie [01:46:31] !log jasmine@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1163 [01:47:14] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [01:52:47] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1163 - jasmine@cumin2002" [01:52:53] !log jasmine@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1163 - jasmine@cumin2002" [01:52:53] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:52:53] !log jasmine@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker1163.eqiad.wmnet 55.48.64.10.in-addr.arpa 5.5.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [01:52:57] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1163.eqiad.wmnet 55.48.64.10.in-addr.arpa 5.5.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [01:52:58] !log jasmine@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1163 [01:53:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P94562 and previous config saved to /var/cache/conftool/dbconfig/20260630-015345-ladsgroup.json [01:54:55] !log jasmine@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1163 [01:54:55] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1163 [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T0200) [02:00:42] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:03:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T410589)', diff saved to https://phabricator.wikimedia.org/P94563 and previous config saved to /var/cache/conftool/dbconfig/20260630-020353-ladsgroup.json [02:03:58] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:04:09] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2194.codfw.wmnet with reason: Maintenance [02:04:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T410589)', diff saved to https://phabricator.wikimedia.org/P94564 and previous config saved to /var/cache/conftool/dbconfig/20260630-020416-ladsgroup.json [02:07:28] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 45s) [02:08:44] (03PS1) 10Clare Ming: Remove webUIScroll config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306454 (https://phabricator.wikimedia.org/T415370) [02:09:33] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1163.eqiad.wmnet with reason: host reimage [02:09:42] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410589)', diff saved to https://phabricator.wikimedia.org/P94565 and previous config saved to /var/cache/conftool/dbconfig/20260630-021149-ladsgroup.json [02:11:55] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:14:24] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1163.eqiad.wmnet with reason: host reimage [02:14:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P94566 and previous config saved to /var/cache/conftool/dbconfig/20260630-022157-ladsgroup.json [02:32:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P94567 and previous config saved to /var/cache/conftool/dbconfig/20260630-023204-ladsgroup.json [02:33:11] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1163.eqiad.wmnet with OS trixie [02:34:34] !log homer lsw1-d3-eqiad* commit 'T430226' [02:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:39] T430226: Automate workflow for vlan migrations on k8s worker nodes - https://phabricator.wikimedia.org/T430226 [02:42:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410589)', diff saved to https://phabricator.wikimedia.org/P94568 and previous config saved to /var/cache/conftool/dbconfig/20260630-024212-ladsgroup.json [02:42:17] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:42:18] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2209.codfw.wmnet with reason: Maintenance [02:42:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P94569 and previous config saved to /var/cache/conftool/dbconfig/20260630-024225-ladsgroup.json [02:42:51] jasmine@cumin2002 renumber-node (PID 3242723) is awaiting input [02:49:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P94570 and previous config saved to /var/cache/conftool/dbconfig/20260630-024941-ladsgroup.json [02:49:47] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:50:00] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2112.codfw.wmnet with OS trixie [02:53:12] jasmine@cumin2002 renumber-node (PID 3242723) is awaiting input [02:59:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P94571 and previous config saved to /var/cache/conftool/dbconfig/20260630-025948-ladsgroup.json [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T0300) [03:01:50] (03PS1) 10TrainBranchBot: testwikis to 1.47.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306455 (https://phabricator.wikimedia.org/T423918) [03:01:53] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306455 (https://phabricator.wikimedia.org/T423918) (owner: 10TrainBranchBot) [03:02:48] (03Merged) 10jenkins-bot: testwikis to 1.47.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306455 (https://phabricator.wikimedia.org/T423918) (owner: 10TrainBranchBot) [03:03:18] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.47.0-wmf.9 refs T423918 [03:03:23] T423918: 1.47.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T423918 [03:06:34] !log jasmine@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1163.eqiad.wmnet [03:06:36] !log jasmine@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1163.eqiad.wmnet [03:06:39] !log jasmine@cumin2002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker1163.eqiad.wmnet [03:08:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:45] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2112.codfw.wmnet with reason: host reimage [03:09:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P94572 and previous config saved to /var/cache/conftool/dbconfig/20260630-030956-ladsgroup.json [03:13:56] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2112.codfw.wmnet with reason: host reimage [03:20:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P94573 and previous config saved to /var/cache/conftool/dbconfig/20260630-032004-ladsgroup.json [03:20:09] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:20:20] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2227.codfw.wmnet with reason: Maintenance [03:20:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T410589)', diff saved to https://phabricator.wikimedia.org/P94574 and previous config saved to /var/cache/conftool/dbconfig/20260630-032028-ladsgroup.json [03:22:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [03:28:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T410589)', diff saved to https://phabricator.wikimedia.org/P94575 and previous config saved to /var/cache/conftool/dbconfig/20260630-032802-ladsgroup.json [03:28:08] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:32:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [03:32:39] (03CR) 10Anzx: "for temporary variant wouldn't it be ideal to use variant https://gerrit.wikimedia.org/g/operations/mediawiki-config/%2B/refs/heads/master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306221 (https://phabricator.wikimedia.org/T430512) (owner: 10Mszwarc) [03:34:33] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2112.codfw.wmnet with OS trixie [03:38:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P94576 and previous config saved to /var/cache/conftool/dbconfig/20260630-033809-ladsgroup.json [03:41:57] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.47.0-wmf.9 refs T423918 (duration: 38m 39s) [03:42:01] T423918: 1.47.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T423918 [03:48:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P94577 and previous config saved to /var/cache/conftool/dbconfig/20260630-034818-ladsgroup.json [03:58:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T410589)', diff saved to https://phabricator.wikimedia.org/P94578 and previous config saved to /var/cache/conftool/dbconfig/20260630-035825-ladsgroup.json [03:58:30] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:58:41] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2239.codfw.wmnet with reason: Maintenance [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T0400) [04:02:44] !log mwpresync@deploy1003 Pruned MediaWiki: 1.47.0-wmf.6 (duration: 02m 36s) [04:04:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:18:58] (03CR) 10Hashar: "I have poked *SRE Infrastructure Foundation* team on their IRC channel (recommended by Filippo and Bryan yesterday)." [puppet] - 10https://gerrit.wikimedia.org/r/1306161 (https://phabricator.wikimedia.org/T430479) (owner: 10Hashar) [04:33:09] (03CR) 10Giuseppe Lavagetto: "While I dont' think this change is problematic, I don't think it has any specific effect in practice, as no request to the upload cluster " [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) (owner: 10BCornwall) [04:48:50] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2114.codfw.wmnet with OS trixie [05:02:07] (03PS1) 10Arnaudb: gitlab: lower TTL for CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1306459 (https://phabricator.wikimedia.org/T425441) [05:07:17] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2114.codfw.wmnet with reason: host reimage [05:07:38] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:14:46] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2114.codfw.wmnet with reason: host reimage [05:19:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1210 with weight 0 T430540', diff saved to https://phabricator.wikimedia.org/P94579 and previous config saved to /var/cache/conftool/dbconfig/20260630-051920-marostegui.json [05:19:26] T430540: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T430540 [05:19:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T430540 [05:19:54] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1306317 (https://phabricator.wikimedia.org/T430540) (owner: 10Gerrit maintenance bot) [05:24:24] !log Starting s5 eqiad failover from db1230 to db1210 - T430540 [05:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:28] T430540: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T430540 [05:24:42] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: OOB IPV6 down - https://phabricator.wikimedia.org/T430599#12068734 (10ayounsi) FYI it works from my home internet: ` laptop:~$ ping -6 mr1-ulsfo.oob.wikimedia.org PING mr1-ulsfo.oob.wikimedia.org (2607:fb58:9000:7::2) 56 data by... [05:25:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T430540', diff saved to https://phabricator.wikimedia.org/P94580 and previous config saved to /var/cache/conftool/dbconfig/20260630-052523-marostegui.json [05:25:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1210 to s5 primary and set section read-write T430540', diff saved to https://phabricator.wikimedia.org/P94581 and previous config saved to /var/cache/conftool/dbconfig/20260630-052547-marostegui.json [05:26:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1230 T430540', diff saved to https://phabricator.wikimedia.org/P94582 and previous config saved to /var/cache/conftool/dbconfig/20260630-052624-marostegui.json [05:26:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1230: Repooling after switchover [05:27:47] (03CR) 10Marostegui: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306318 (https://phabricator.wikimedia.org/T430540) (owner: 10Gerrit maintenance bot) [05:28:10] !log marostegui@dns1004 START - running authdns-update [05:28:43] (03CR) 10Dzahn: "can we really go this low nowadays? As far as I know so far 5M has been the lowest." [dns] - 10https://gerrit.wikimedia.org/r/1306459 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [05:28:45] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1230: Repooling after switchover [05:28:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1230: Repooling after switchover [05:30:10] !log marostegui@dns1004 END - running authdns-update [05:30:31] (03CR) 10Arnaudb: "I _think_ we can, @ssingh@wikimedia.org suggested that to simplify the revert if we needed to get gitlab out of the CDN for any reason." [dns] - 10https://gerrit.wikimedia.org/r/1306459 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [05:35:22] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2114.codfw.wmnet with OS trixie [05:38:03] (03PS11) 10Arnaudb: trafficserver: add a map for gitlab instances as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [05:53:16] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1306486 (https://phabricator.wikimedia.org/T430610) [05:53:22] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306487 (https://phabricator.wikimedia.org/T430610) [05:54:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T430610 [05:54:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1189 with weight 0 T430610', diff saved to https://phabricator.wikimedia.org/P94585 and previous config saved to /var/cache/conftool/dbconfig/20260630-055423-marostegui.json [05:54:25] T430610: Switchover s3 master (db1223 -> db1189) - https://phabricator.wikimedia.org/T430610 [05:54:55] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1306486 (https://phabricator.wikimedia.org/T430610) (owner: 10Gerrit maintenance bot) [05:55:32] !log Starting s3 eqiad failover from db1223 to db1189 - T430610 [05:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T430610', diff saved to https://phabricator.wikimedia.org/P94586 and previous config saved to /var/cache/conftool/dbconfig/20260630-055752-marostegui.json [05:58:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1189 to s3 primary and set section read-write T430610', diff saved to https://phabricator.wikimedia.org/P94587 and previous config saved to /var/cache/conftool/dbconfig/20260630-055812-marostegui.json [05:58:26] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2064.codfw.wmnet with OS trixie [05:58:34] (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306487 (https://phabricator.wikimedia.org/T430610) (owner: 10Gerrit maintenance bot) [05:58:34] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2077.codfw.wmnet with OS trixie [05:58:38] !log marostegui@dns1004 START - running authdns-update [05:59:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1223 T430610', diff saved to https://phabricator.wikimedia.org/P94588 and previous config saved to /var/cache/conftool/dbconfig/20260630-055906-marostegui.json [05:59:26] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1223: Repooling after switchover [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T0600) [06:00:04] marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T0600). [06:00:09] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1223: Repooling after switchover [06:00:18] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1223: Repooling after switchover [06:00:36] !log marostegui@dns1004 END - running authdns-update [06:14:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1230: Repooling after switchover [06:15:51] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2077.codfw.wmnet with reason: host reimage [06:16:07] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2064.codfw.wmnet with reason: host reimage [06:22:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:22:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:22:49] !log marostegui@dns1004 START - running authdns-update [06:22:59] !log marostegui@dns1004 START - running authdns-update [06:23:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:23:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:23:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:23:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:23:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:23:07] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:25:52] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cirrussearch2077.codfw.wmnet with reason: host reimage [06:26:43] !log marostegui@dns1004 END - running authdns-update [06:28:03] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:28:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:28:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:28:07] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:33:57] PROBLEM - Elasticsearch HTTPS for production-search-codfw on cirrussearch2064 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [06:34:33] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on cirrussearch2077 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [06:35:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2064.codfw.wmnet with reason: host reimage [06:35:33] RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on cirrussearch2077 is OK: SSL OK - Certificate cirrussearch2077.codfw.wmnet valid until 2026-07-28 06:29:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [06:35:53] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:38:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:38:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:38:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:38:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:38:07] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:38:07] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:40:56] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2077.codfw.wmnet with OS trixie [06:45:01] RECOVERY - Elasticsearch HTTPS for production-search-codfw on cirrussearch2064 is OK: SSL OK - Certificate cirrussearch2064.codfw.wmnet valid until 2026-07-28 06:39:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [06:45:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1223: Repooling after switchover [06:45:53] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:48:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:48:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:48:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:48:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:48:07] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:48:07] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:53:39] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2159 - https://phabricator.wikimedia.org/T430240#12068866 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Relevant dmesg entries: ` [Jun26 01:00] mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [ +0.000002]... [06:55:23] (03PS1) 10Muehlenhoff: Update account meta data for migurski [puppet] - 10https://gerrit.wikimedia.org/r/1306488 [06:56:41] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2064.codfw.wmnet with OS trixie [06:57:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2077-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [07:00:04] Amir1, urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T0700). [07:00:04] Msz2001: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:53] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:01:30] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12068875 (10Monrac5) Hello, thank you! I just signed [07:02:39] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:02:39] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:02:42] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Deploy wmflib 3.0.0 to production - https://phabricator.wikimedia.org/T430552#12068877 (10elukey) Changelog here: https://github.com/wikimedia/operations-software-pywmflib/blob/master/CHANGELOG.rst I am going to rollout the changes for Debian 11, then I'll d... [07:03:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:03:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:03:03] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:03:03] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:03:05] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:03:05] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:03:05] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:03:05] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:03:05] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:03:07] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:05:04] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [07:05:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1230 (T426633)', diff saved to https://phabricator.wikimedia.org/P94596 and previous config saved to /var/cache/conftool/dbconfig/20260630-070512-fceratto.json [07:07:25] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Deploy wmflib 3.0.0 to production - https://phabricator.wikimedia.org/T430552#12068878 (10MoritzMuehlenhoff) Sounds good! [07:07:28] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2098.codfw.wmnet with OS trixie [07:07:40] !log installing libconfig-inifiles-perl security updates [07:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:12:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T426633)', diff saved to https://phabricator.wikimedia.org/P94597 and previous config saved to /var/cache/conftool/dbconfig/20260630-071206-fceratto.json [07:18:29] (03CR) 10Slyngshede: [C:03+1] Update account meta data for migurski [puppet] - 10https://gerrit.wikimedia.org/r/1306488 (owner: 10Muehlenhoff) [07:18:55] (03CR) 10Muehlenhoff: [C:03+2] Update account meta data for migurski [puppet] - 10https://gerrit.wikimedia.org/r/1306488 (owner: 10Muehlenhoff) [07:21:39] !log upgrade all bullseye hosts to pywmflib 3.0 - T430552 [07:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:44] T430552: Deploy wmflib 3.0.0 to production - https://phabricator.wikimedia.org/T430552 [07:22:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P94598 and previous config saved to /var/cache/conftool/dbconfig/20260630-072214-fceratto.json [07:23:23] !log installing libgd-perl security updates [07:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:54] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2098.codfw.wmnet with reason: host reimage [07:27:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2077-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [07:32:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P94599 and previous config saved to /var/cache/conftool/dbconfig/20260630-073221-fceratto.json [07:34:38] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2098.codfw.wmnet with reason: host reimage [07:35:25] (03PS3) 10Mszwarc: Temporarily change plwiki tagline for 1.7M articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306221 (https://phabricator.wikimedia.org/T430512) [07:37:12] !log installing libhttp-daemon-perl security updates [07:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:09] (03CR) 10Anzx: [C:03+1] Temporarily change plwiki tagline for 1.7M articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306221 (https://phabricator.wikimedia.org/T430512) (owner: 10Mszwarc) [07:40:23] !log installing libtext-csv-xs-perl security updates [07:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:25] (03PS1) 10Filippo Giunchedi: admin: add monathierse to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1306495 (https://phabricator.wikimedia.org/T430304) [07:42:04] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12068953 (10fgiunchedi) Thank you all! @Monrac5 we'd need to verify your ssh public key out of band. please let me know when it would be a good time for... [07:42:04] (03CR) 10CI reject: [V:04-1] admin: add monathierse to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1306495 (https://phabricator.wikimedia.org/T430304) (owner: 10Filippo Giunchedi) [07:42:28] (03CR) 10Mszwarc: "Moved to July 1st, the threshold hasn't been reached yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306221 (https://phabricator.wikimedia.org/T430512) (owner: 10Mszwarc) [07:42:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T426633)', diff saved to https://phabricator.wikimedia.org/P94600 and previous config saved to /var/cache/conftool/dbconfig/20260630-074229-fceratto.json [07:43:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12068956 (10fgiunchedi) @Milimetric @Ahoelzl @Ottomata I'm seeking `analytics-privatedata-users` approval for Mona, an former WMDE... [07:44:25] (03PS2) 10Filippo Giunchedi: admin: add monathierse to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1306495 (https://phabricator.wikimedia.org/T430304) [07:45:45] !log installing nodejs security updates [07:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:47] (03PS7) 10Daniel Kinzler: rest-gateway: emit 401 if rate limit is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298031 (https://phabricator.wikimedia.org/T428184) [07:46:48] (03PS2) 10Daniel Kinzler: rest-gateway: put request ID into rate limit respose [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300775 [07:47:20] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v3.1.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1306496 [07:52:08] (03PS3) 10Daniel Kinzler: rest-gateway: put request ID into rate limit respose [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300775 [07:53:07] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2099.codfw.wmnet with OS trixie [07:54:29] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v3.1.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1306496 (owner: 10Elukey) [07:57:09] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2098.codfw.wmnet with OS trixie [07:57:24] (03PS1) 10Elukey: Upstream release v3.1.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1306499 [07:57:43] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v3.1.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1306499 (owner: 10Elukey) [07:59:34] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for rscout - https://phabricator.wikimedia.org/T430594#12069051 (10fgiunchedi) [08:00:05] andre and brennen: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T0800) [08:00:26] jouncebot: Thanks, but train is currently blocked on broken icons. [08:01:52] (03PS1) 10Filippo Giunchedi: admin: add rscout to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1306500 (https://phabricator.wikimedia.org/T430594) [08:03:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for rscout - https://phabricator.wikimedia.org/T430594#12069073 (10fgiunchedi) @Rscout we need to verify your ssh key out of band, please let me know when it would be a good time for a quick google meet. feel also free to send... [08:04:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:05:00] (03PS1) 10Giuseppe Lavagetto: hiddenparma: add default ratelimits file [puppet] - 10https://gerrit.wikimedia.org/r/1306501 (https://phabricator.wikimedia.org/T422249) [08:05:03] (03PS1) 10Giuseppe Lavagetto: cache::varnish: add rate-limit file generated from hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1306502 (https://phabricator.wikimedia.org/T422249) [08:05:06] (03PS1) 10Giuseppe Lavagetto: cache::varnish: switch known client rate limits to hp-generated data [puppet] - 10https://gerrit.wikimedia.org/r/1306503 (https://phabricator.wikimedia.org/T422249) [08:07:50] !log uploaded python3-wmflib_3.1.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia,trixie-wikimedia [08:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:10] (03CR) 10CI reject: [V:04-1] cache::varnish: add rate-limit file generated from hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1306502 (https://phabricator.wikimedia.org/T422249) (owner: 10Giuseppe Lavagetto) [08:08:51] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [08:08:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1223 (T426633)', diff saved to https://phabricator.wikimedia.org/P94601 and previous config saved to /var/cache/conftool/dbconfig/20260630-080858-fceratto.json [08:12:49] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2099.codfw.wmnet with reason: host reimage [08:14:36] (03PS2) 10Btullis: datahub: pin the production release to chart 0.0.82 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306440 (https://phabricator.wikimedia.org/T402408) [08:14:36] (03PS2) 10Btullis: datahub: upgrade chart to DataHub 1.6.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306441 (https://phabricator.wikimedia.org/T402408) [08:14:36] (03PS2) 10Btullis: datahub-next: use DataHub 1.6.0 images for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306442 (https://phabricator.wikimedia.org/T402408) [08:14:36] (03PS1) 10Btullis: datahub: bump umbrella chart to 0.0.82 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306504 (https://phabricator.wikimedia.org/T402408) [08:16:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T426633)', diff saved to https://phabricator.wikimedia.org/P94602 and previous config saved to /var/cache/conftool/dbconfig/20260630-081609-fceratto.json [08:17:07] (03CR) 10CI reject: [V:04-1] datahub: pin the production release to chart 0.0.82 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306440 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [08:17:15] (03CR) 10CI reject: [V:04-1] datahub-next: use DataHub 1.6.0 images for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306442 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [08:17:19] (03CR) 10CI reject: [V:04-1] datahub: upgrade chart to DataHub 1.6.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306441 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [08:19:24] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2099.codfw.wmnet with reason: host reimage [08:19:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1306500 (https://phabricator.wikimedia.org/T430594) (owner: 10Filippo Giunchedi) [08:21:36] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Fix Pypi twine setup for pywmflib - https://phabricator.wikimedia.org/T430620 (10elukey) 03NEW [08:24:10] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Fix Pypi twine setup for pywmflib - https://phabricator.wikimedia.org/T430620#12069181 (10Volans) The build on `build2004` is only for the debian package, the PyPi release is all based on the local artifacts created locally. The reported version `3.1.1.dev0+g... [08:24:52] (03PS1) 10Muehlenhoff: Failover url-downloader.eqiad CNAME to one of the new Trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1306530 (https://phabricator.wikimedia.org/T427282) [08:26:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P94603 and previous config saved to /var/cache/conftool/dbconfig/20260630-082616-fceratto.json [08:26:34] (03PS1) 10Elukey: profile::cache::haproxy: change webrequest top 10k IPs map name [puppet] - 10https://gerrit.wikimedia.org/r/1306545 (https://phabricator.wikimedia.org/T402512) [08:27:13] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306545 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [08:27:17] (03PS1) 10Ayounsi: depool-rack: run the k8s cookbook with relevant alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1306551 (https://phabricator.wikimedia.org/T327300) [08:29:36] (03PS2) 10Blake: kube-state-metrics: Add v2.18.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305377 (https://phabricator.wikimedia.org/T427405) [08:29:58] (03CR) 10Blake: kube-state-metrics: Add v2.18.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305377 (https://phabricator.wikimedia.org/T427405) (owner: 10Blake) [08:30:35] (03PS1) 10Isabelle Hurbain-Palatin: Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306575 (https://phabricator.wikimedia.org/T430194) [08:30:51] ihurbain: ^^^!!!! 🎉 [08:30:51] (03CR) 10CI reject: [V:04-1] depool-rack: run the k8s cookbook with relevant alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1306551 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [08:30:57] moaar parsoid [08:31:00] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Fix Pypi twine setup for pywmflib - https://phabricator.wikimedia.org/T430620#12069199 (10elukey) >>! In T430620#12069181, @Volans wrote: > The build on `build2004` is only for the debian package, the PyPi release is all based on the local artifacts created l... [08:31:04] (03PS2) 10Isabelle Hurbain-Palatin: Turn on Parsoid Read views for 25% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306575 (https://phabricator.wikimedia.org/T430194) [08:31:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306575 (https://phabricator.wikimedia.org/T430194) (owner: 10Isabelle Hurbain-Palatin) [08:32:00] (03CR) 10Filippo Giunchedi: [C:03+2] wikimedia.org: add dumps-nfs [dns] - 10https://gerrit.wikimedia.org/r/1305406 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [08:32:10] !log filippo@dns1004 START - running authdns-update [08:32:16] hashar: it's even better than that, we have 5% since yesterday, and we're turning to 25% this afternoon - I typo'd the commit message :P [08:32:27] :-] [08:32:30] (03CR) 10Atsuko: [C:03+1] datahub: bump umbrella chart to 0.0.82 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306504 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [08:32:39] (03CR) 10Atsuko: [C:03+1] datahub: pin the production release to chart 0.0.82 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306440 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [08:32:42] (well, i copy-pasted and forgot to adjust, more precisely.) [08:33:45] (03CR) 10Filippo Giunchedi: [C:03+2] conftool-data: add dumps-nfs [puppet] - 10https://gerrit.wikimedia.org/r/1305402 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [08:34:10] !log filippo@dns1004 END - running authdns-update [08:34:24] (03CR) 10Filippo Giunchedi: [C:03+2] dumps: open nfs port to lb healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1305403 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [08:35:02] (03CR) 10Filippo Giunchedi: [C:03+2] dumps: add dumps-nfs service pool [puppet] - 10https://gerrit.wikimedia.org/r/1305405 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [08:35:14] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: add dumps-nfs service in service_setup state [puppet] - 10https://gerrit.wikimedia.org/r/1305404 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [08:35:14] (03CR) 10Atsuko: [C:03+1] datahub-next: use DataHub 1.6.0 images for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306442 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [08:35:15] (03CR) 10Muehlenhoff: [C:03+2] Failover url-downloader.eqiad CNAME to one of the new Trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1306530 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [08:35:32] !log jmm@dns1004 START - running authdns-update [08:35:56] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1306614 (https://phabricator.wikimedia.org/T430624) [08:36:16] (03CR) 10Atsuko: [C:03+1] datahub: upgrade chart to DataHub 1.6.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306441 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [08:36:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P94604 and previous config saved to /var/cache/conftool/dbconfig/20260630-083624-fceratto.json [08:37:32] !log jmm@dns1004 END - running authdns-update [08:39:32] (03PS2) 10Ayounsi: depool-rack: run the k8s cookbook with relevant alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1306551 (https://phabricator.wikimedia.org/T327300) [08:39:35] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2099.codfw.wmnet with OS trixie [08:40:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T430624 [08:40:13] T430624: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T430624 [08:41:39] (03CR) 10Hashar: [C:03+1] "I am not sure what is the rationale compared to updating the existing `jenkins_proxy` file to switch to a new scheme/host for the backend." [puppet] - 10https://gerrit.wikimedia.org/r/1306445 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [08:42:15] (03PS15) 10Arnaudb: trafficserver: add a map for gitlab instances as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [08:42:15] (03CR) 10Arnaudb: "per our chat yesterday with @ssingh@wikimedia.org I modified taskgen.rb to let it render templates and use references in Gitlab's configur" [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [08:44:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2207 with weight 0 T430624', diff saved to https://phabricator.wikimedia.org/P94605 and previous config saved to /var/cache/conftool/dbconfig/20260630-084436-fceratto.json [08:46:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T426633)', diff saved to https://phabricator.wikimedia.org/P94606 and previous config saved to /var/cache/conftool/dbconfig/20260630-084632-fceratto.json [08:46:48] (03CR) 10JMeybohm: [C:03+1] kube-state-metrics: Add v2.18.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305377 (https://phabricator.wikimedia.org/T427405) (owner: 10Blake) [08:49:04] (03PS1) 10Muehlenhoff: Failover url-downloader.codfw CNAME to one of the new Trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1306622 (https://phabricator.wikimedia.org/T427282) [08:50:07] (03PS1) 10Jelto: Update calico to v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306307 (https://phabricator.wikimedia.org/T427400) [08:50:07] (03CR) 10Jelto: "I created the upstream diff with:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306307 (https://phabricator.wikimedia.org/T427400) (owner: 10Jelto) [08:50:24] (03PS2) 10Jelto: Update calico to v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306307 (https://phabricator.wikimedia.org/T427400) [08:51:23] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1306614 (https://phabricator.wikimedia.org/T430624) (owner: 10Gerrit maintenance bot) [08:52:23] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:55:07] (03CR) 10Muehlenhoff: [C:03+2] Failover url-downloader.codfw CNAME to one of the new Trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1306622 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [08:55:12] !log jmm@dns1004 START - running authdns-update [08:57:13] !log jmm@dns1004 END - running authdns-update [08:57:23] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:58:48] (03PS3) 10Blake: kube-state-metrics: Add v2.18.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305377 (https://phabricator.wikimedia.org/T427405) [09:01:11] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1027.eqiad.wmnet,service=s2 [09:01:16] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1027.eqiad.wmnet,service=s7 [09:02:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1027].eqiad.wmnet with reason: cloning [09:02:38] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s7 [09:02:43] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s2 [09:03:46] (03CR) 10Btullis: [C:03+2] datahub: bump umbrella chart to 0.0.82 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306504 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:03:53] (03CR) 10Ozge: [C:03+1] "left a small comment about the exiting qwen3-14b." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306286 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [09:04:38] !log Starting s2 codfw failover from db2204 to db2207 - T430624 [09:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:42] T430624: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T430624 [09:05:25] (03Abandoned) 10Cathal Mooney: Apply regular peering preference to primary IXP if AS-Path >= 3 hops [homer/public] - 10https://gerrit.wikimedia.org/r/1306369 (owner: 10Cathal Mooney) [09:05:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db2207 to s2 primary T430624', diff saved to https://phabricator.wikimedia.org/P94609 and previous config saved to /var/cache/conftool/dbconfig/20260630-090530-fceratto.json [09:05:57] (03CR) 10Muehlenhoff: [C:03+2] package_builder: Also specify apt key for three other source sources [puppet] - 10https://gerrit.wikimedia.org/r/1306271 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [09:06:11] (03Merged) 10jenkins-bot: datahub: bump umbrella chart to 0.0.82 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306504 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:07:53] (03PS1) 10Aklapper: Fix overflow menu for non-advanced users [skins/MinervaNeue] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306629 (https://phabricator.wikimedia.org/T428220) [09:08:41] FIRING: ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_dumps-nfs.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:08:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set weight db2204 T430624', diff saved to https://phabricator.wikimedia.org/P94610 and previous config saved to /var/cache/conftool/dbconfig/20260630-090841-fceratto.json [09:09:04] (03CR) 10Aklapper: [V:03+2 C:03+2] "Cherry-picking this train blocker, per conversation in Slack" [skins/MinervaNeue] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306629 (https://phabricator.wikimedia.org/T428220) (owner: 10Aklapper) [09:11:35] I am going to do a backport and then run the usual train deployment [09:11:44] (03CR) 10Btullis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306440 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:12:10] (03PS1) 10AikoChou: changeprop: add liftwing revertrisk-wikidata stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306630 (https://phabricator.wikimedia.org/T420883) [09:12:19] !log aklapper@deploy1003 Started scap sync-world: Backport for [[gerrit:1306629|Fix overflow menu for non-advanced users (T428220)]] [09:12:25] T428220: Scale Mobile Account Menu to All WMF Wikis - https://phabricator.wikimedia.org/T428220 [09:13:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance [09:13:06] (03CR) 10Jforrester: [C:03+1] Turn on Parsoid Read views for 25% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306575 (https://phabricator.wikimedia.org/T430194) (owner: 10Isabelle Hurbain-Palatin) [09:13:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2204 (T426633)', diff saved to https://phabricator.wikimedia.org/P94611 and previous config saved to /var/cache/conftool/dbconfig/20260630-091307-fceratto.json [09:13:41] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_dumps-nfs.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:16:27] !log aklapper@deploy1003 aklapper: Backport for [[gerrit:1306629|Fix overflow menu for non-advanced users (T428220)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:17:39] !log aklapper@deploy1003 aklapper: Continuing with deployment [09:17:49] (03PS3) 10Jelto: Update calico to v3.30.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306307 (https://phabricator.wikimedia.org/T427400) [09:19:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T426633)', diff saved to https://phabricator.wikimedia.org/P94612 and previous config saved to /var/cache/conftool/dbconfig/20260630-091915-fceratto.json [09:19:20] !log filippo@puppetserver1001 conftool action : set/pooled=yes:weight=100; selector: service=dumps-nfs [09:20:40] (03PS1) 10Filippo Giunchedi: hieradata: set dumps-nfs in state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1306631 (https://phabricator.wikimedia.org/T411248) [09:21:44] (03PS4) 10Daniel Kinzler: rest-gateway: put request ID into rate limit respose [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300775 [09:23:33] (03PS2) 10Kamila Součková: aux/zarcillo: don't hardcode helmBinary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304588 (https://phabricator.wikimedia.org/T388390) [09:23:41] RESOLVED: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_dumps-nfs.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:23:59] !log aklapper@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306629|Fix overflow menu for non-advanced users (T428220)]] (duration: 11m 40s) [09:24:05] T428220: Scale Mobile Account Menu to All WMF Wikis - https://phabricator.wikimedia.org/T428220 [09:24:47] 06SRE, 06Data-Persistence, 10Kafka-Infrastructure: Update roll-restart-reboot-brokers.py to display broker id and FQDN of the broker - https://phabricator.wikimedia.org/T425747#12069501 (10elukey) [09:25:16] 06SRE, 06Infrastructure-Foundations, 10Kafka-Infrastructure, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade Kafka to version 3.x - https://phabricator.wikimedia.org/T416669#12069503 (10elukey) [09:25:46] 06SRE, 10Kafka-Infrastructure: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#12069504 (10elukey) [09:26:33] (03PS1) 10Ayounsi: rack depool: use build in reason fuction [cookbooks] - 10https://gerrit.wikimedia.org/r/1306637 [09:27:58] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 10Kafka-Infrastructure, and 2 others: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#12069508 (10elukey) 05Stalled→03In progress p:05Low→03Medium a:03RKemper [09:28:31] 10SRE-SLO, 10Citoid, 06Editing-team, 07Sustainability (Incident Followup): Improve monitoring in citoid so that url-downloader failures are detected - https://phabricator.wikimedia.org/T381372#12069518 (10Mvolz) [09:29:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P94613 and previous config saved to /var/cache/conftool/dbconfig/20260630-092923-fceratto.json [09:30:39] (03PS1) 10TrainBranchBot: group0 to 1.47.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306638 (https://phabricator.wikimedia.org/T423918) [09:30:42] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306638 (https://phabricator.wikimedia.org/T423918) (owner: 10TrainBranchBot) [09:31:19] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on cirrussearch2087 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [09:31:19] PROBLEM - Elasticsearch HTTPS for production-search-codfw on cirrussearch2087 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [09:31:41] (03Merged) 10jenkins-bot: group0 to 1.47.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306638 (https://phabricator.wikimedia.org/T423918) (owner: 10TrainBranchBot) [09:32:19] RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on cirrussearch2087 is OK: SSL OK - Certificate cirrussearch2087.codfw.wmnet valid until 2026-07-28 09:26:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [09:32:19] RECOVERY - Elasticsearch HTTPS for production-search-codfw on cirrussearch2087 is OK: SSL OK - Certificate cirrussearch2087.codfw.wmnet valid until 2026-07-28 09:26:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [09:34:23] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:34:23] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [09:34:30] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [09:34:43] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1261: Upgrading db1261.eqiad.wmnet [09:35:14] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1261: Upgrading db1261.eqiad.wmnet [09:38:05] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.47.0-wmf.9 refs T423918 [09:38:08] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306630 (https://phabricator.wikimedia.org/T420883) (owner: 10AikoChou) [09:38:09] T423918: 1.47.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T423918 [09:38:14] cwilliams@cumin1003 major-upgrade (PID 65730) is awaiting input [09:38:18] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#12069589 (10MoritzMuehlenhoff) All traffic now goes via the new nodes. Tomorrow I'll switch the old proxies into the insetup role and stop Squid (and keep them around for a grace period) [09:39:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P94615 and previous config saved to /var/cache/conftool/dbconfig/20260630-093931-fceratto.json [09:39:53] (03CR) 10Muehlenhoff: [C:03+2] build2004: Enable profile::docker::builder::docker_pkg [puppet] - 10https://gerrit.wikimedia.org/r/1306245 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [09:40:43] (03CR) 10AikoChou: [C:03+2] changeprop: add liftwing revertrisk-wikidata stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306630 (https://phabricator.wikimedia.org/T420883) (owner: 10AikoChou) [09:41:39] (03PS1) 10Marostegui: mariadb: Productionize clouddb1027 [puppet] - 10https://gerrit.wikimedia.org/r/1306641 (https://phabricator.wikimedia.org/T409557) [09:42:28] (03PS2) 10Marostegui: mariadb: Productionize clouddb1027 [puppet] - 10https://gerrit.wikimedia.org/r/1306641 (https://phabricator.wikimedia.org/T409557) [09:42:53] (03Merged) 10jenkins-bot: changeprop: add liftwing revertrisk-wikidata stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306630 (https://phabricator.wikimedia.org/T420883) (owner: 10AikoChou) [09:44:25] cwilliams@cumin1003 major-upgrade (PID 65730) is awaiting input [09:45:09] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1306631 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [09:45:16] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize clouddb1027 [puppet] - 10https://gerrit.wikimedia.org/r/1306641 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [09:45:28] (03CR) 10Santiago Faci: [C:03+2] growthbook: Updated chart to add API_RATE_LIMIT_MAX env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305785 (https://phabricator.wikimedia.org/T429420) (owner: 10Santiago Faci) [09:45:32] (03CR) 10Btullis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306441 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:45:45] (03CR) 10Btullis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306442 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:46:32] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: set dumps-nfs in state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1306631 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [09:48:10] (03CR) 10Btullis: [C:03+2] datahub: pin the production release to chart 0.0.82 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306440 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:48:17] (03CR) 10Btullis: [C:03+2] datahub: upgrade chart to DataHub 1.6.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306441 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:48:19] (03Merged) 10jenkins-bot: growthbook: Updated chart to add API_RATE_LIMIT_MAX env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305785 (https://phabricator.wikimedia.org/T429420) (owner: 10Santiago Faci) [09:49:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T426633)', diff saved to https://phabricator.wikimedia.org/P94616 and previous config saved to /var/cache/conftool/dbconfig/20260630-094938-fceratto.json [09:50:22] (03Merged) 10jenkins-bot: datahub: pin the production release to chart 0.0.82 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306440 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:50:25] (03Merged) 10jenkins-bot: datahub: upgrade chart to DataHub 1.6.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306441 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:51:12] !log aikochou@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: sync [09:51:20] !log aikochou@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: sync [09:51:53] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:53:21] that's me ^ [09:53:30] !log restart pybal on A:lvs-secondary-eqiad [09:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:33] (03CR) 10Btullis: [C:03+2] datahub-next: use DataHub 1.6.0 images for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306442 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:53:51] (03CR) 10Santiago Faci: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) (owner: 10Clare Ming) [09:55:17] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 22 connections established with conf1007.eqiad.wmnet:4001 (min=24) https://wikitech.wikimedia.org/wiki/PyBal [09:55:29] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - dumps-lb6_2049: Servers clouddumps1001.wikimedia.org are marked down but pooled: dumps-lb_2049: Servers clouddumps1001.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:55:43] (03Merged) 10jenkins-bot: datahub-next: use DataHub 1.6.0 images for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306442 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [09:55:55] PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:56:11] !log restart pybal on A:lvs-high-traffic2-eqiad [09:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:53] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:57:14] standing by for the recoveries [09:57:26] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: put request ID into rate limit respose [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300775 (owner: 10Daniel Kinzler) [09:57:31] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: emit 401 if rate limit is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298031 (https://phabricator.wikimedia.org/T428184) (owner: 10Daniel Kinzler) [09:58:56] (03PS1) 10Clément Goubert: trafficserver::backend: Remove X-W-D for api.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1306643 (https://phabricator.wikimedia.org/T428909) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1000) [10:00:10] (03Merged) 10jenkins-bot: rest-gateway: emit 401 if rate limit is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298031 (https://phabricator.wikimedia.org/T428184) (owner: 10Daniel Kinzler) [10:00:13] (03Merged) 10jenkins-bot: rest-gateway: put request ID into rate limit respose [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300775 (owner: 10Daniel Kinzler) [10:00:13] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1261.eqiad.wmnet with OS trixie [10:00:17] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 24 connections established with conf1007.eqiad.wmnet:4001 (min=24) https://wikitech.wikimedia.org/wiki/PyBal [10:00:55] RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:01:00] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12069743 (10MoritzMuehlenhoff) [10:01:19] (03PS1) 10Slyngshede: Permissions: Create log entry on auto-expire [software/bitu] - 10https://gerrit.wikimedia.org/r/1306645 (https://phabricator.wikimedia.org/T418843) [10:02:17] (03PS1) 10Jforrester: Add abstractwiki_fetch_section_token to POST requests [extensions/WikiLambda] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306646 [10:04:48] (03CR) 10Federico Ceratto: [C:03+2] aux/zarcillo: don't hardcode helmBinary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304588 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [10:04:58] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304588 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [10:07:35] (03PS1) 10JavierMonton: stream: webrequest.page_view.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306647 (https://phabricator.wikimedia.org/T426091) [10:07:35] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:08:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2087-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [10:12:38] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:15:28] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1261.eqiad.wmnet with reason: host reimage [10:16:00] (03CR) 10DCausse: [C:03+1] stream: webrequest.page_view.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306647 (https://phabricator.wikimedia.org/T426091) (owner: 10JavierMonton) [10:16:34] (03PS4) 10Hashar: ci: add repositories to gitcache [puppet] - 10https://gerrit.wikimedia.org/r/1302834 (https://phabricator.wikimedia.org/T430627) (owner: 10Arnaudb) [10:18:21] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:18:45] (03CR) 10Elukey: "Left a style comment but the logic looks good, lemme know!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1306551 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [10:19:44] (03CR) 10Hashar: [C:03+1] "I have finally done the benchmark. It almost twice faster. I have amended the commit message with the command I have used and with the res" [puppet] - 10https://gerrit.wikimedia.org/r/1302834 (https://phabricator.wikimedia.org/T430627) (owner: 10Arnaudb) [10:19:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1261.eqiad.wmnet with reason: host reimage [10:20:06] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:21:05] (03PS1) 10Revi: CommonSettings: add Ombuds to wgWMCGlobalGroupToRateLimitClass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306649 (https://phabricator.wikimedia.org/T430641) [10:21:08] (03PS1) 10Mszwarc: SuggestedInvestigations: Defer signal matching until transaction commits [extensions/CheckUser] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306650 (https://phabricator.wikimedia.org/T430617) [10:21:33] (03PS1) 10Mszwarc: SuggestedInvestigations: Defer signal matching until transaction commits [extensions/CheckUser] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306651 (https://phabricator.wikimedia.org/T430617) [10:21:48] (03CR) 10Revi: [C:04-1] "Pending WMF T&S approval." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306649 (https://phabricator.wikimedia.org/T430641) (owner: 10Revi) [10:24:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by javiermonton@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306647 (https://phabricator.wikimedia.org/T426091) (owner: 10JavierMonton) [10:25:01] (03Merged) 10jenkins-bot: stream: webrequest.page_view.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306647 (https://phabricator.wikimedia.org/T426091) (owner: 10JavierMonton) [10:25:19] (03CR) 10Elukey: rack depool: use build in reason fuction (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1306637 (owner: 10Ayounsi) [10:25:25] !log javiermonton@deploy1003 Started scap sync-world: Backport for [[gerrit:1306647|stream: webrequest.page_view.dev0 (T426091)]] [10:25:29] T426091: Schema and Stream for "webrequest.page_view" - https://phabricator.wikimedia.org/T426091 [10:25:36] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:25:47] (03CR) 10Hnowlan: [C:03+2] restbase: remove icinga disk space check, use alertmanager check [puppet] - 10https://gerrit.wikimedia.org/r/1304853 (https://phabricator.wikimedia.org/T407141) (owner: 10Hnowlan) [10:25:57] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:26:01] (03PS1) 10Jforrester: abstractwiki: Show Abstract provenance notice to all readers, not just sysops [extensions/WikiLambda] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306652 (https://phabricator.wikimedia.org/T422710) [10:26:37] (03PS2) 10Hnowlan: restbase: remove icinga disk space check, use alertmanager check [puppet] - 10https://gerrit.wikimedia.org/r/1304853 (https://phabricator.wikimedia.org/T407141) [10:27:30] !log javiermonton@deploy1003 javiermonton: Backport for [[gerrit:1306647|stream: webrequest.page_view.dev0 (T426091)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:27:32] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:28:59] !log javiermonton@deploy1003 javiermonton: Continuing with deployment [10:30:21] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply [10:32:51] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: apply} [10:33:21] !log javiermonton@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306647|stream: webrequest.page_view.dev0 (T426091)]] (duration: 07m 56s) [10:33:26] T426091: Schema and Stream for "webrequest.page_view" - https://phabricator.wikimedia.org/T426091 [10:35:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1261.eqiad.wmnet with OS trixie [10:35:59] (03CR) 10Hnowlan: [C:03+2] restbase: remove icinga disk space check, use alertmanager check [puppet] - 10https://gerrit.wikimedia.org/r/1304853 (https://phabricator.wikimedia.org/T407141) (owner: 10Hnowlan) [10:38:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: cloning [10:39:30] (03PS1) 10Marostegui: mariadb: Move db1228 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/1306654 (https://phabricator.wikimedia.org/T430111) [10:40:03] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [10:40:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1228.eqiad.wmnet with reason: cloning [10:41:11] (03PS1) 10Santiago Faci: growthbook: Bumped chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306655 [10:41:13] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12069917 (10MoritzMuehlenhoff) [10:41:58] (03CR) 10Btullis: [C:03+1] growthbook: Bumped chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306655 (owner: 10Santiago Faci) [10:42:10] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1306645 (https://phabricator.wikimedia.org/T418843) (owner: 10Slyngshede) [10:42:42] PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:42:56] ^ expected [10:43:00] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:43:18] (03PS1) 10Ozge: editing-suggestions: bump model to 20260630103440 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306656 [10:43:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbproxy[1026,1028].eqiad.wmnet with reason: cloning [10:44:15] (03CR) 10Santiago Faci: [C:03+2] growthbook: Bumped chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306655 (owner: 10Santiago Faci) [10:44:16] (03PS1) 10Blake: Test change [puppet] - 10https://gerrit.wikimedia.org/r/1306657 [10:44:18] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1228 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/1306654 (https://phabricator.wikimedia.org/T430111) (owner: 10Marostegui) [10:44:24] (03CR) 10Hnowlan: [C:03+1] trafficserver::backend: Remove X-W-D for api.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1306643 (https://phabricator.wikimedia.org/T428909) (owner: 10Clément Goubert) [10:44:41] (03Abandoned) 10Blake: Test change [puppet] - 10https://gerrit.wikimedia.org/r/1306657 (owner: 10Blake) [10:45:20] (03CR) 10Blake: [C:03+2] kube-state-metrics: Add v2.18.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305377 (https://phabricator.wikimedia.org/T427405) (owner: 10Blake) [10:45:51] (03PS2) 10Ozge: editing-suggestions: bump model to 20260630103440 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306656 (https://phabricator.wikimedia.org/T428882) [10:46:14] (03CR) 10Blake: [V:03+2 C:03+2] kube-state-metrics: Add v2.18.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1305377 (https://phabricator.wikimedia.org/T427405) (owner: 10Blake) [10:46:32] (03Merged) 10jenkins-bot: growthbook: Bumped chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306655 (owner: 10Santiago Faci) [10:47:26] (03PS3) 10Ozge: ml-services: editing-suggestions model bump to 20260630103440 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306656 (https://phabricator.wikimedia.org/T428882) [10:47:43] (03PS4) 10Ozge: ml-services: editing-suggestions model bump to 20260630103440 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306656 (https://phabricator.wikimedia.org/T428882) [10:48:10] (03PS5) 10Arnaudb: ci: add repositories to gitcache [puppet] - 10https://gerrit.wikimedia.org/r/1302834 (https://phabricator.wikimedia.org/T430627) [10:48:37] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1306398 (https://phabricator.wikimedia.org/T429578) (owner: 10Andrew Bogott) [10:48:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2087-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [10:49:32] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - dumps-lb6_2049: Servers clouddumps1001.wikimedia.org are marked down but pooled: dumps-lb_2049: Servers clouddumps1001.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:49:33] (03CR) 10Clément Goubert: [C:03+2] trafficserver::backend: Remove X-W-D for api.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1306643 (https://phabricator.wikimedia.org/T428909) (owner: 10Clément Goubert) [10:50:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [10:52:45] (03PS1) 10Mszwarc: SuggestedInvestigations: Link to Interaction Timeline for shared pages [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306658 (https://phabricator.wikimedia.org/T429785) [10:53:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:54:49] (03CR) 10Samtar: [C:03+1] "makes sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/1305970 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [10:55:13] (03CR) 10Arnaudb: "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1302834 (https://phabricator.wikimedia.org/T430627) (owner: 10Arnaudb) [10:55:23] (03CR) 10Arnaudb: [C:03+2] ci: add repositories to gitcache [puppet] - 10https://gerrit.wikimedia.org/r/1302834 (https://phabricator.wikimedia.org/T430627) (owner: 10Arnaudb) [10:55:38] ACKNOWLEDGEMENT - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - dumps-lb6_2049: Servers clouddumps1001.wikimedia.org are marked down but pooled: dumps-lb_2049: Servers clouddumps1001.wikimedia.org are marked down but pooled Filippo Giunchedi known https://wikitech.wikimedia.org/wiki/PyBal [10:55:54] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:56:04] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:58:39] (03CR) 10JavierMonton: [C:03+2] stream: pageview-trending-relative-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306263 (https://phabricator.wikimedia.org/T430134) (owner: 10JavierMonton) [11:00:06] (03PS1) 10Hnowlan: restbase: move nrpe check to prom blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1306661 (https://phabricator.wikimedia.org/T407141) [11:00:48] (03Merged) 10jenkins-bot: stream: pageview-trending-relative-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306263 (https://phabricator.wikimedia.org/T430134) (owner: 10JavierMonton) [11:01:30] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8813/console" [puppet] - 10https://gerrit.wikimedia.org/r/1306661 (https://phabricator.wikimedia.org/T407141) (owner: 10Hnowlan) [11:01:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:03:23] (03CR) 10Hnowlan: [V:03+1] restbase: move nrpe check to prom blackbox check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1306661 (https://phabricator.wikimedia.org/T407141) (owner: 10Hnowlan) [11:03:28] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/pageview-trending-relative-next: apply [11:03:39] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/pageview-trending-relative-next: apply [11:06:10] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12070030 (10Milimetric) Approved (NOTE: I initially thought there was no NDA on file, but see from the comments it's now available,... [11:06:51] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1306495 (https://phabricator.wikimedia.org/T430304) (owner: 10Filippo Giunchedi) [11:07:17] (03CR) 10Kamila Součková: [C:03+2] aux/zarcillo: don't hardcode helmBinary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304588 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [11:08:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:55] o/ would like to do a private code deploy in the open time if nothing else is happening [11:09:51] (03Merged) 10jenkins-bot: aux/zarcillo: don't hardcode helmBinary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304588 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [11:11:04] (03PS1) 10Marostegui: check_private_data_report: Add clouddb1027 [puppet] - 10https://gerrit.wikimedia.org/r/1306663 (https://phabricator.wikimedia.org/T409557) [11:11:12] (03CR) 10Slyngshede: [C:03+2] Permissions: Create log entry on auto-expire [software/bitu] - 10https://gerrit.wikimedia.org/r/1306645 (https://phabricator.wikimedia.org/T418843) (owner: 10Slyngshede) [11:12:01] (03CR) 10Marostegui: [C:03+2] check_private_data_report: Add clouddb1027 [puppet] - 10https://gerrit.wikimedia.org/r/1306663 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [11:12:25] starting [11:13:39] !log installing Linux 6.12.94 on Trixie hosts [11:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:13] (03Merged) 10jenkins-bot: Permissions: Create log entry on auto-expire [software/bitu] - 10https://gerrit.wikimedia.org/r/1306645 (https://phabricator.wikimedia.org/T418843) (owner: 10Slyngshede) [11:15:52] Tran: could you ping me when you finish? I'd like to do a deployment as well (public code this time) [11:17:17] np [11:19:45] RECOVERY - haproxy failover on dbproxy1028 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:20:01] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:21:50] !log Deployed patch for T427287 [11:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:57] Msz2001: done [11:22:05] Thanks [11:22:23] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:22:38] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:22:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306650 (https://phabricator.wikimedia.org/T430617) (owner: 10Mszwarc) [11:22:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306651 (https://phabricator.wikimedia.org/T430617) (owner: 10Mszwarc) [11:22:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306658 (https://phabricator.wikimedia.org/T429785) (owner: 10Mszwarc) [11:23:16] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1261: Migration of db1261.eqiad.wmnet completed [11:24:37] (03Merged) 10jenkins-bot: SuggestedInvestigations: Defer signal matching until transaction commits [extensions/CheckUser] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306650 (https://phabricator.wikimedia.org/T430617) (owner: 10Mszwarc) [11:24:39] (03Merged) 10jenkins-bot: SuggestedInvestigations: Defer signal matching until transaction commits [extensions/CheckUser] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306651 (https://phabricator.wikimedia.org/T430617) (owner: 10Mszwarc) [11:26:33] !log installing libpng1.6 security updates [11:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:23] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:27:46] (03Merged) 10jenkins-bot: SuggestedInvestigations: Link to Interaction Timeline for shared pages [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306658 (https://phabricator.wikimedia.org/T429785) (owner: 10Mszwarc) [11:28:17] (03PS1) 10Kosta Harlan: SimpleCaptcha: Log skipcaptcha right in force-show trigger [extensions/ConfirmEdit] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306667 (https://phabricator.wikimedia.org/T402595) [11:28:19] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1306650|SuggestedInvestigations: Defer signal matching until transaction commits (T430617)]], [[gerrit:1306651|SuggestedInvestigations: Defer signal matching until transaction commits (T430617)]], [[gerrit:1306658|SuggestedInvestigations: Link to Interaction Timeline for shared pages (T429785)]] [11:28:26] T430617: Suggested investigations should ensure that the target user exists before trying to match signals for them - https://phabricator.wikimedia.org/T430617 [11:28:26] T429785: Link to InteractionTimeline from SI case - https://phabricator.wikimedia.org/T429785 [11:28:35] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12070069 (10MoritzMuehlenhoff) [11:30:22] (03CR) 10Marostegui: Allow a single replica for sre.mysql.major-upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1305682 (https://phabricator.wikimedia.org/T429758) (owner: 10CWilliams) [11:30:23] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1306650|SuggestedInvestigations: Defer signal matching until transaction commits (T430617)]], [[gerrit:1306651|SuggestedInvestigations: Defer signal matching until transaction commits (T430617)]], [[gerrit:1306658|SuggestedInvestigations: Link to Interaction Timeline for shared pages (T429785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebu [11:30:23] g). Changes can now be verified there. [11:31:53] !log mszwarc@deploy1003 mszwarc: Continuing with deployment [11:32:23] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:35:15] Msz2001: please let me know when you're finished [11:35:47] ack [11:36:07] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306650|SuggestedInvestigations: Defer signal matching until transaction commits (T430617)]], [[gerrit:1306651|SuggestedInvestigations: Defer signal matching until transaction commits (T430617)]], [[gerrit:1306658|SuggestedInvestigations: Link to Interaction Timeline for shared pages (T429785)]] (duration: 07m 48s) [11:36:13] T430617: Suggested investigations should ensure that the target user exists before trying to match signals for them - https://phabricator.wikimedia.org/T430617 [11:36:13] T429785: Link to InteractionTimeline from SI case - https://phabricator.wikimedia.org/T429785 [11:36:21] kostajh: done [11:37:23] FIRING: [14x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:37:44] jouncebot: nowandnext [11:37:44] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [11:37:44] In 0 hour(s) and 22 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1200) [11:37:53] (03PS2) 10Klausman: role/ml_k8s/staging/worker: add IPIP role [puppet] - 10https://gerrit.wikimedia.org/r/1305623 (https://phabricator.wikimedia.org/T42043) [11:38:00] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [11:38:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306667 (https://phabricator.wikimedia.org/T402595) (owner: 10Kosta Harlan) [11:38:48] (03PS1) 10Muehlenhoff: Enable profile::docker::builder::prune_images on build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1306670 (https://phabricator.wikimedia.org/T417389) [11:39:04] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [11:40:18] (03Merged) 10jenkins-bot: SimpleCaptcha: Log skipcaptcha right in force-show trigger [extensions/ConfirmEdit] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306667 (https://phabricator.wikimedia.org/T402595) (owner: 10Kosta Harlan) [11:40:46] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1306667|SimpleCaptcha: Log skipcaptcha right in force-show trigger (T402595)]] [11:40:55] T402595: Allow AbuseFilter CAPTCHA actions to apply to users with skipcaptcha right - https://phabricator.wikimedia.org/T402595 [11:42:47] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s2 [11:42:50] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s7 [11:42:50] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1306667|SimpleCaptcha: Log skipcaptcha right in force-show trigger (T402595)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:43:03] !log atsuko@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2104.codfw.wmnet with OS trixie [11:44:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306670 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [11:44:07] !log kharlan@deploy1003 kharlan: Continuing with deployment [11:44:33] !log atsuko@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2105.codfw.wmnet with OS trixie [11:45:55] (03CR) 10CWilliams: Allow a single replica for sre.mysql.major-upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1305682 (https://phabricator.wikimedia.org/T429758) (owner: 10CWilliams) [11:48:25] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306667|SimpleCaptcha: Log skipcaptcha right in force-show trigger (T402595)]] (duration: 07m 39s) [11:48:30] T402595: Allow AbuseFilter CAPTCHA actions to apply to users with skipcaptcha right - https://phabricator.wikimedia.org/T402595 [11:57:03] (03PS1) 10Filippo Giunchedi: dumps: temp allow production_networks for nfs healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1306672 (https://phabricator.wikimedia.org/T430651) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1200) [12:01:13] (03CR) 10Marostegui: Allow a single replica for sre.mysql.major-upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1305682 (https://phabricator.wikimedia.org/T429758) (owner: 10CWilliams) [12:02:50] !log atsuko@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2104.codfw.wmnet with reason: host reimage [12:03:32] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12070229 (10fgiunchedi) >>! In T430304#12070030, @Milimetric wrote: > Approved (NOTE: I initially thought there was no NDA on file,... [12:04:04] !log atsuko@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2105.codfw.wmnet with reason: host reimage [12:05:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12070243 (10fgiunchedi) [12:07:18] !log upgrade all bookworm hosts to pywmflib 3.0 - T430552 [12:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:24] T430552: Deploy wmflib 3.0.0 to production - https://phabricator.wikimedia.org/T430552 [12:07:57] 06SRE, 06Infrastructure-Foundations, 10netops: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242#12070253 (10cmooney) lsw1-d2-eqiad has been left in a "bad" state and I opened 05827811 against it with Nokia. [12:08:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1261: Migration of db1261.eqiad.wmnet completed [12:08:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:10:24] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2104.codfw.wmnet with reason: host reimage [12:14:24] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2105.codfw.wmnet with reason: host reimage [12:15:56] (03CR) 10Marostegui: [C:03+1] Allow a single replica for sre.mysql.major-upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1305682 (https://phabricator.wikimedia.org/T429758) (owner: 10CWilliams) [12:18:54] (03PS1) 10Marostegui: eqiad.yaml: Add clouddb1027 [puppet] - 10https://gerrit.wikimedia.org/r/1306677 (https://phabricator.wikimedia.org/T409557) [12:25:43] (03CR) 10Muehlenhoff: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1306672 (https://phabricator.wikimedia.org/T430651) (owner: 10Filippo Giunchedi) [12:26:11] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180#12070303 (10elukey) 05Open→03Declined Declining, please re-open if needed! [12:26:58] !log klausman@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 365 days, 0:00:00 on 6 hosts with reason: Silence for decommissioning [12:27:37] (03CR) 10Filippo Giunchedi: [C:03+2] dumps: temp allow production_networks for nfs healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1306672 (https://phabricator.wikimedia.org/T430651) (owner: 10Filippo Giunchedi) [12:28:35] !log klausman@cumin2002 START - Cookbook sre.hosts.decommission for hosts ml-cache[2001-2003].codfw.wmnet,ml-cache[1001-1003].eqiad.wmnet [12:29:52] 10SRE-tools: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943#12070317 (10elukey) 05Open→03Resolved a:03elukey Resoling this, since most of the work is done and only a couple of tasks are left to do. [12:30:15] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2104.codfw.wmnet with OS trixie [12:31:04] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#12070337 (10elukey) 05Open→03Resolved a:03elukey Pretty sure this is not an issue anymore, resolving, please re-open if I am mistaken. [12:31:45] klausman@cumin2002 decommission (PID 3379585) is awaiting input [12:32:44] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:34:14] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2105.codfw.wmnet with OS trixie [12:34:29] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:34:29] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [12:34:49] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1262: Upgrading db1262.eqiad.wmnet [12:35:20] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1262: Upgrading db1262.eqiad.wmnet [12:36:01] (03PS2) 10Klausman: admin_ng/ml;ml-services: Remove ML Cassandra machines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306681 (https://phabricator.wikimedia.org/T430654) [12:36:48] (03PS3) 10Klausman: admin_ng/ml;ml-services: Remove ML Cassandra machines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306681 (https://phabricator.wikimedia.org/T430654) [12:37:17] (03PS4) 10Klausman: admin_ng/ml;ml-services: Remove ML Cassandra machines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306681 (https://phabricator.wikimedia.org/T430654) [12:38:53] (03CR) 10Jelto: "@jmeybohm@wikimedia.org I dumped the full changelog for v3.30 into the task description of T427400." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306307 (https://phabricator.wikimedia.org/T427400) (owner: 10Jelto) [12:41:18] 06SRE, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06tools-infrastructure-team: Upgrade cloudsw1-e4-eqiad - https://phabricator.wikimedia.org/T429013#12070397 (10ayounsi) Unfortunately there is no easy way to perform the maintenance without impacting the devices connected to that switch. [12:41:48] cwilliams@cumin1003 major-upgrade (PID 126872) is awaiting input [12:42:07] FIRING: ProbeDown: Service ml-cache2001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#ml-cache2001-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:44:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12070412 (10cmooney) I spoke to @VRiley-WMF about this and we are going to schedule the change for Wed July 1st starting a... [12:47:07] FIRING: [8x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:16] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [12:51:31] (03PS1) 10Klausman: hiera/manifests/install: Remove ml-cache machines [puppet] - 10https://gerrit.wikimedia.org/r/1306682 (https://phabricator.wikimedia.org/T430654) [12:52:07] FIRING: [12x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:52:22] FIRING: [12x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:12] PROBLEM - Host wikikube-worker1267 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3379.17 ms [12:54:38] RECOVERY - Host wikikube-worker1267 is UP: PING WARNING - Packet loss = 33%, RTA = 664.79 ms [12:54:53] (03CR) 10Tiziano Fogli: docker_registry: migrate nrpe checks to alertmanager (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1306351 (https://phabricator.wikimedia.org/T384321) (owner: 10Hnowlan) [12:55:52] (03CR) 10Ozge: [C:03+2] ml-services: editing-suggestions model bump to 20260630103440 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306656 (https://phabricator.wikimedia.org/T428882) (owner: 10Ozge) [12:56:20] !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ml-cache[2001-2003].codfw.wmnet,ml-cache[1001-1003].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin2002" [12:56:49] (03CR) 10Ozge: [V:03+2 C:03+2] "merging this for now to share with the editing team today but please feel free to add your comments and we can still make more changes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306656 (https://phabricator.wikimedia.org/T428882) (owner: 10Ozge) [12:56:57] !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ml-cache[2001-2003].codfw.wmnet,ml-cache[1001-1003].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin2002" [12:56:57] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:56:58] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ml-cache[2001-2003].codfw.wmnet,ml-cache[1001-1003].eqiad.wmnet [12:57:03] (03PS1) 10Giuseppe Lavagetto: * Decouple DSL templating from view layer, switch to jinja2 * Support matching on varnish vmod_var variables * Generate ratelimit stanzas for known-clients in HP [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1306683 [12:57:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304173 (https://phabricator.wikimedia.org/T422770) (owner: 10BPirkle) [12:57:07] FIRING: [12x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:22] FIRING: [12x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:58:09] (03Merged) 10jenkins-bot: ml-services: editing-suggestions model bump to 20260630103440 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306656 (https://phabricator.wikimedia.org/T428882) (owner: 10Ozge) [12:59:26] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] * Decouple DSL templating from view layer, switch to jinja2 * Support matching on varnish vmod_var variables * Generate ratelimit stanzas fo [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1306683 (owner: 10Giuseppe Lavagetto) [12:59:32] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:59:49] jouncebot: nowandnext [12:59:49] For the next 0 hour(s) and 0 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1200) [12:59:49] In 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1300) [13:00:04] Lucas_WMDE, urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1300). nyaa~ [13:00:04] MichaelG_WMF, ihurbain, and bpirkle: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:00:10] o/ [13:00:15] o/ [13:00:16] nyaa~ [13:00:22] I'm here. I can go last if others want to go first, I'm in no hurry [13:00:32] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:00:43] \o [13:01:00] I’d suggest ihurbain, then MichaelG_WMF, then bpirkle [13:01:01] (03CR) 10Clément Goubert: [C:03+1] Enable profile::docker::builder::prune_images on build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1306670 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [13:01:12] works for me [13:01:20] the gate-and-submit of the backport should give us some time to see if anything starts to look fishy due to parsoid rollout [13:01:33] (and i can spiderpig myself, if that works) [13:01:40] hi hi [13:01:42] I'll do my private settings changes separately [13:01:44] ihurbain: go ahead IMHO [13:02:01] That order works for me 👍 [13:02:06] wheeeeeeeeeeeeee! [13:02:07] FIRING: [12x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306575 (https://phabricator.wikimedia.org/T430194) (owner: 10Isabelle Hurbain-Palatin) [13:03:10] ^^^ that cassandra failure is my bad, looking into it [13:03:18] (03Merged) 10jenkins-bot: Turn on Parsoid Read views for 25% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306575 (https://phabricator.wikimedia.org/T430194) (owner: 10Isabelle Hurbain-Palatin) [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:49] !log ihurbain@deploy1003 Started scap sync-world: Backport for [[gerrit:1306575|Turn on Parsoid Read views for 25% of English Wikipedia desktop traffic (T430194)]] [13:03:54] T430194: Parsoid Read Views deploy to English Wikipedia (enwiki) June 25-June 30 - https://phabricator.wikimedia.org/T430194 [13:04:14] (03PS2) 10Giuseppe Lavagetto: hiddenparma: add default ratelimits file [puppet] - 10https://gerrit.wikimedia.org/r/1306501 (https://phabricator.wikimedia.org/T422249) [13:04:15] (03PS2) 10Giuseppe Lavagetto: cache::varnish: add rate-limit file generated from hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1306502 (https://phabricator.wikimedia.org/T422249) [13:04:15] (03PS2) 10Giuseppe Lavagetto: cache::varnish: switch known client rate limits to hp-generated data [puppet] - 10https://gerrit.wikimedia.org/r/1306503 (https://phabricator.wikimedia.org/T422249) [13:04:46] !log atsuko@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS trixie [13:05:57] !log ihurbain@deploy1003 ihurbain: Backport for [[gerrit:1306575|Turn on Parsoid Read views for 25% of English Wikipedia desktop traffic (T430194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:06:58] !log ihurbain@deploy1003 ihurbain: Continuing with deployment [13:07:26] !log atsuko@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2094.codfw.wmnet with OS trixie [13:07:44] (03CR) 10Jelto: [C:03+1] "lgtm, but I don't know the DNS internals if 60s is supported. But 300s should also be fine if we announce maintenance beforehand in my opi" [dns] - 10https://gerrit.wikimedia.org/r/1306459 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [13:09:02] !log atsuko@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2070.codfw.wmnet with OS trixie [13:10:20] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Allow to easily disable puppet-merges temporarily - https://phabricator.wikimedia.org/T423121#12070532 (10MoritzMuehlenhoff) 05Open→03Resolved A new cookbook has been added added and is documented at https://wikitech.wikimedia.org/wiki/Pup... [13:11:18] !log ihurbain@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306575|Turn on Parsoid Read views for 25% of English Wikipedia desktop traffic (T430194)]] (duration: 07m 29s) [13:11:23] T430194: Parsoid Read Views deploy to English Wikipedia (enwiki) June 25-June 30 - https://phabricator.wikimedia.org/T430194 [13:11:31] done! [13:11:32] MichaelG_WMF: you need a deployer, right? [13:11:44] Lucas_WMDE: yes, I do. [13:12:14] ok, I can deploy [13:12:33] > Change(s) 1306342 touch l10n-related files and are likely to trigger a large l10n rebuild, resulting in a slow deployment (~20 minutes). [13:12:48] I think that’s a false positive from extension.json [13:12:53] continue anyway [13:12:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306341 (https://phabricator.wikimedia.org/T429110) (owner: 10Michael Große) [13:12:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Echo] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306342 (https://phabricator.wikimedia.org/T429110) (owner: 10Michael Große) [13:15:44] correct, neither change is intended to touch i18n [13:22:03] (03Merged) 10jenkins-bot: postEdit: temp account experiment instrumentation [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306341 (https://phabricator.wikimedia.org/T429110) (owner: 10Michael Große) [13:22:16] !log atsuko@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2078.codfw.wmnet with reason: host reimage [13:22:44] (03Merged) 10jenkins-bot: maybeSendThankYouEdit: avoid sending notification to temp users [extensions/Echo] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306342 (https://phabricator.wikimedia.org/T429110) (owner: 10Michael Große) [13:23:13] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1306341|postEdit: temp account experiment instrumentation (T429110)]], [[gerrit:1306342|maybeSendThankYouEdit: avoid sending notification to temp users (T429110 T424205)]] [13:23:20] T429110: Temp Accounts A/B Test: Post-Publish Account Creation bottom sheet - https://phabricator.wikimedia.org/T429110 [13:23:21] T424205: Post-publish account creation nudge for temp accounts - https://phabricator.wikimedia.org/T424205 [13:24:03] (03PS1) 10Kosta Harlan: ConfirmEdit: Show a clear message when a login CAPTCHA is missing [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306688 (https://phabricator.wikimedia.org/T428892) [13:24:17] (03PS1) 10Kosta Harlan: ConfirmEdit: Show a clear message when a login CAPTCHA is missing [extensions/ConfirmEdit] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306689 (https://phabricator.wikimedia.org/T428892) [13:25:18] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, migr: Backport for [[gerrit:1306341|postEdit: temp account experiment instrumentation (T429110)]], [[gerrit:1306342|maybeSendThankYouEdit: avoid sending notification to temp users (T429110 T424205)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:25:45] MichaelG_WMF: please test :) [13:25:52] I'll have a look [13:26:00] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1262.eqiad.wmnet with OS trixie [13:27:04] !log atsuko@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2094.codfw.wmnet with reason: host reimage [13:27:41] !log atsuko@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2070.codfw.wmnet with reason: host reimage [13:28:55] (03PS2) 10Arnaudb: taskgen: allow profile_yaml to render templates [puppet] - 10https://gerrit.wikimedia.org/r/1306481 (https://phabricator.wikimedia.org/T425441) [13:29:30] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306327 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [13:29:47] (03CR) 10Ayounsi: depool-rack: run the k8s cookbook with relevant alias (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1306551 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [13:30:30] (03PS1) 10Cathal Mooney: LVS: add public vlan IPs/subnets for LVS still connected to L2 vlans [puppet] - 10https://gerrit.wikimedia.org/r/1306690 (https://phabricator.wikimedia.org/T430651) [13:30:31] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2078.codfw.wmnet with reason: host reimage [13:32:11] The echo one I can confirm working. The other one is tricky to verify because of aggressive temp account creation rate limits [13:33:36] (03CR) 10Ayounsi: rack depool: use build in reason fuction (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1306637 (owner: 10Ayounsi) [13:33:37] I'd say let's assume it is fine and move forward. If we notice a problem later, we can still investigate. At least I did not notice any disruptions [13:34:05] ok [13:34:07] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, migr: Continuing with deployment [13:34:08] thanks! [13:34:11] (03PS1) 10Jcrespo: Puppet 8: Replace legacy facts (backup version) [puppet] - 10https://gerrit.wikimedia.org/r/1306693 (https://phabricator.wikimedia.org/T372666) [13:35:05] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2070.codfw.wmnet with reason: host reimage [13:35:40] (03PS1) 10Fabfur: varnish: skip ratelimits on misc for translatewiki [puppet] - 10https://gerrit.wikimedia.org/r/1306694 (https://phabricator.wikimedia.org/T430613) [13:35:48] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306693 (https://phabricator.wikimedia.org/T372666) (owner: 10Jcrespo) [13:37:08] PROBLEM - Elasticsearch HTTPS for production-search-codfw on cirrussearch2094 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [13:37:42] (03PS2) 10Cathal Mooney: LVS: add public vlan IPs/subnets for LVS still connected to L2 vlans [puppet] - 10https://gerrit.wikimedia.org/r/1306690 (https://phabricator.wikimedia.org/T430651) [13:38:25] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306341|postEdit: temp account experiment instrumentation (T429110)]], [[gerrit:1306342|maybeSendThankYouEdit: avoid sending notification to temp users (T429110 T424205)]] (duration: 15m 12s) [13:38:31] (03PS1) 10Muehlenhoff: Enable the weekly base build on build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1306695 (https://phabricator.wikimedia.org/T417389) [13:38:32] T429110: Temp Accounts A/B Test: Post-Publish Account Creation bottom sheet - https://phabricator.wikimedia.org/T429110 [13:38:32] T424205: Post-publish account creation nudge for temp accounts - https://phabricator.wikimedia.org/T424205 [13:38:43] bpirkle: over to you [13:38:57] Thank you! [13:39:15] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2094.codfw.wmnet with reason: host reimage [13:39:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12070658 (10Monrac5) >>! In T430304#12068953, @fgiunchedi wrote: > Thank you all! > > @Monrac5 we'd need to verify your ssh public... [13:39:54] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12070660 (10MoritzMuehlenhoff) [13:40:23] 10ops-eqiad, 06DC-Ops: Unresponsive management for backup1013.mgmt:22 - https://phabricator.wikimedia.org/T430661 (10phaultfinder) 03NEW [13:40:52] (03PS2) 10Fabfur: varnish: skip ratelimits on misc for translatewiki [puppet] - 10https://gerrit.wikimedia.org/r/1306694 (https://phabricator.wikimedia.org/T430613) [13:41:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bpirkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304173 (https://phabricator.wikimedia.org/T422770) (owner: 10BPirkle) [13:41:42] RECOVERY - MD RAID on wikikube-worker2159 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:42:16] (03Merged) 10jenkins-bot: REST: remove obsolete and unnecessary config entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304173 (https://phabricator.wikimedia.org/T422770) (owner: 10BPirkle) [13:42:35] @Lucas_WMDE Thank you for the deployment 🙏 [13:42:40] np :) [13:42:42] !log bpirkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1304173|REST: remove obsolete and unnecessary config entries (T422770 T423058 T422771)]] [13:42:51] T422770: REST: Audience Designations - clean up module enabling - https://phabricator.wikimedia.org/T422770 [13:42:51] T423058: REST: Audience Designations - clean up module enabling - enable site.v1 and specs.v0 in core by default - https://phabricator.wikimedia.org/T423058 [13:42:51] T422771: REST: Audience Designations - publish modules to REST Sandbox by default - https://phabricator.wikimedia.org/T422771 [13:42:57] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1262.eqiad.wmnet with reason: host reimage [13:44:57] !log bpirkle@deploy1003 bpirkle: Backport for [[gerrit:1304173|REST: remove obsolete and unnecessary config entries (T422770 T423058 T422771)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:48:45] !log bpirkle@deploy1003 bpirkle: Continuing with deployment [13:49:09] RECOVERY - Elasticsearch HTTPS for production-search-codfw on cirrussearch2094 is OK: SSL OK - Certificate cirrussearch2094.codfw.wmnet valid until 2026-07-28 13:43:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [13:50:03] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1262.eqiad.wmnet with reason: host reimage [13:51:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306695 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [13:51:59] (03PS2) 10CWilliams: Allow a single replica for sre.mysql.major-upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1305682 (https://phabricator.wikimedia.org/T429758) [13:52:15] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2078.codfw.wmnet with OS trixie [13:53:04] !log bpirkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1304173|REST: remove obsolete and unnecessary config entries (T422770 T423058 T422771)]] (duration: 10m 21s) [13:53:11] T422770: REST: Audience Designations - clean up module enabling - https://phabricator.wikimedia.org/T422770 [13:53:12] T423058: REST: Audience Designations - clean up module enabling - enable site.v1 and specs.v0 in core by default - https://phabricator.wikimedia.org/T423058 [13:53:12] T422771: REST: Audience Designations - publish modules to REST Sandbox by default - https://phabricator.wikimedia.org/T422771 [13:53:58] Doing mine [13:56:43] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2070.codfw.wmnet with OS trixie [13:57:24] Using scap now [13:58:45] (03PS2) 10Urbanecm: [Growth] frwiki: Deploy automated mentor list cleaner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305867 (https://phabricator.wikimedia.org/T427386) [13:58:46] (03PS1) 10Gkyziridis: ml-services: Deploy qwen36-27b in eager mode to fix startup timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306699 (https://phabricator.wikimedia.org/T425680) [13:58:50] (03CR) 10Urbanecm: [C:03+2] [Growth] frwiki: Deploy automated mentor list cleaner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305867 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [13:58:52] (03PS4) 10Fabfur: cache_misc: apply traffic classification [puppet] - 10https://gerrit.wikimedia.org/r/1276403 (owner: 10Giuseppe Lavagetto) [13:58:58] (03CR) 10Urbanecm: [Growth] frwiki: Deploy automated mentor list cleaner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305867 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [13:59:04] Dreamy_Jazz: sorry, i forgot to check here somehow [13:59:08] i'll wait [13:59:17] (03CR) 10CI reject: [V:04-1] cache_misc: apply traffic classification [puppet] - 10https://gerrit.wikimedia.org/r/1276403 (owner: 10Giuseppe Lavagetto) [14:00:02] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:00:04] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1400) [14:01:23] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2094.codfw.wmnet with OS trixie [14:03:21] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276403 (owner: 10Giuseppe Lavagetto) [14:03:26] (03PS1) 10Arnaudb: backup: drop trailing slashes from gerrit cache/logs excludes [puppet] - 10https://gerrit.wikimedia.org/r/1306698 (https://phabricator.wikimedia.org/T411583) [14:03:51] (03CR) 10Arnaudb: [C:03+2] backup: drop trailing slashes from gerrit cache/logs excludes [puppet] - 10https://gerrit.wikimedia.org/r/1306698 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [14:05:32] urbanecm: You can go now [14:05:41] As long as the window is free (but seems like) [14:05:50] jouncebot: nowandnext [14:05:50] For the next 0 hour(s) and 24 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1400) [14:05:50] In 0 hour(s) and 24 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1430) [14:06:09] (03CR) 10Urbanecm: [C:03+2] [Growth] frwiki: Deploy automated mentor list cleaner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305867 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [14:06:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1262.eqiad.wmnet with OS trixie [14:07:26] (03Merged) 10jenkins-bot: [Growth] frwiki: Deploy automated mentor list cleaner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305867 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [14:07:54] (03PS1) 10Jforrester: SkinComponentFooter: Show copyright for known, not just existing, pages [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306703 (https://phabricator.wikimedia.org/T422655) [14:08:18] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1305867|[Growth] frwiki: Deploy automated mentor list cleaner (T427386)]] [14:08:23] T427386: Deploy automated mentor list cleanup to Wikimedia wikis - https://phabricator.wikimedia.org/T427386 [14:10:25] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1305867|[Growth] frwiki: Deploy automated mentor list cleaner (T427386)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:49] !log urbanecm@deploy1003 urbanecm: Continuing with deployment [14:12:07] (03PS5) 10Fabfur: cache_misc: apply traffic classification [puppet] - 10https://gerrit.wikimedia.org/r/1276403 (owner: 10Giuseppe Lavagetto) [14:13:51] (03CR) 10Klausman: [C:03+2] role/ml_k8s/staging/worker: add IPIP role [puppet] - 10https://gerrit.wikimedia.org/r/1305623 (https://phabricator.wikimedia.org/T42043) (owner: 10Klausman) [14:15:08] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305867|[Growth] frwiki: Deploy automated mentor list cleaner (T427386)]] (duration: 06m 50s) [14:15:13] T427386: Deploy automated mentor list cleanup to Wikimedia wikis - https://phabricator.wikimedia.org/T427386 [14:15:28] (03CR) 10Jcrespo: [C:03+1] "Only phab2002 failed, which probably is a CI artifact. I am happy to merge this, but want to make both @amir and @jhathaway@wikimedia.org " [puppet] - 10https://gerrit.wikimedia.org/r/1306693 (https://phabricator.wikimedia.org/T372666) (owner: 10Jcrespo) [14:18:15] (03PS2) 10Gkyziridis: ml-services: Deploy qwen36-27b in eager mode to fix startup timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306699 (https://phabricator.wikimedia.org/T425680) [14:18:17] !log UTC afternoon backport+config window done [14:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:46] (03CR) 10Revi: CommonSettings: add Ombuds to wgWMCGlobalGroupToRateLimitClass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306649 (https://phabricator.wikimedia.org/T430641) (owner: 10Revi) [14:20:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306649 (https://phabricator.wikimedia.org/T430641) (owner: 10Revi) [14:23:38] !log klausman@cumin2002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: ml-staging-worker@codfw [14:26:15] !log klausman@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [14:26:40] (03PS1) 10Btullis: ceph-csi-rbd: add datahub-next to the tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306706 (https://phabricator.wikimedia.org/T402408) [14:26:43] (03PS1) 10Btullis: opensearch-operator: watch the datahub-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306707 (https://phabricator.wikimedia.org/T402408) [14:26:45] (03PS1) 10Btullis: datahub-next: deploy a dedicated OpenSearch 2.x cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306708 (https://phabricator.wikimedia.org/T402408) [14:27:23] !log klausman@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [14:27:23] !log klausman@cumin2002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: ml-staging-worker@codfw [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1430) [14:32:40] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1262: Migration of db1262.eqiad.wmnet completed [14:35:05] (03PS1) 10Jforrester: abstractwiki: Stop mis-setting the relevant title on Abstract surfaces [extensions/WikiLambda] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306709 (https://phabricator.wikimedia.org/T422655) [14:35:46] Anyone using the TK window? I have a handful of end-of-Year backports (joy). [14:36:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306652 (https://phabricator.wikimedia.org/T422710) (owner: 10Jforrester) [14:36:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306709 (https://phabricator.wikimedia.org/T422655) (owner: 10Jforrester) [14:36:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306703 (https://phabricator.wikimedia.org/T422655) (owner: 10Jforrester) [14:36:29] (03PS1) 10Klausman: hiera: Add IPIP config for workers in ml k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1306704 (https://phabricator.wikimedia.org/T42043) [14:36:32] (03CR) 10Klausman: [C:03+2] hiera: Add IPIP config for workers in ml k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1306704 (https://phabricator.wikimedia.org/T42043) (owner: 10Klausman) [14:36:52] !log atsuko@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2095.codfw.wmnet with OS trixie [14:37:08] (03PS1) 10WMDE-Fisch: Fix async loading in footnote click interaction experiment [extensions/Cite] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306710 (https://phabricator.wikimedia.org/T415904) [14:37:15] (03CR) 10Jforrester: [C:03+2] Add abstractwiki_fetch_section_token to POST requests [extensions/WikiLambda] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306646 (owner: 10Jforrester) [14:37:30] (03PS1) 10WMDE-Fisch: Fix async loading in footnote click interaction experiment [extensions/Cite] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306711 (https://phabricator.wikimedia.org/T415904) [14:38:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Cite] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306710 (https://phabricator.wikimedia.org/T415904) (owner: 10WMDE-Fisch) [14:38:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Cite] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306711 (https://phabricator.wikimedia.org/T415904) (owner: 10WMDE-Fisch) [14:38:52] !log atsuko@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2096.codfw.wmnet with OS trixie [14:40:42] !log atsuko@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2097.codfw.wmnet with OS trixie [14:41:07] (03Merged) 10jenkins-bot: abstractwiki: Show Abstract provenance notice to all readers, not just sysops [extensions/WikiLambda] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306652 (https://phabricator.wikimedia.org/T422710) (owner: 10Jforrester) [14:41:09] (03Merged) 10jenkins-bot: abstractwiki: Stop mis-setting the relevant title on Abstract surfaces [extensions/WikiLambda] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306709 (https://phabricator.wikimedia.org/T422655) (owner: 10Jforrester) [14:41:38] (03Merged) 10jenkins-bot: SkinComponentFooter: Show copyright for known, not just existing, pages [core] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306703 (https://phabricator.wikimedia.org/T422655) (owner: 10Jforrester) [14:42:08] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1306652|abstractwiki: Show Abstract provenance notice to all readers, not just sysops (T422710)]], [[gerrit:1306709|abstractwiki: Stop mis-setting the relevant title on Abstract surfaces (T422655)]], [[gerrit:1306703|SkinComponentFooter: Show copyright for known, not just existing, pages (T422655)]] [14:42:09] (03Merged) 10jenkins-bot: Add abstractwiki_fetch_section_token to POST requests [extensions/WikiLambda] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306646 (owner: 10Jforrester) [14:42:14] T422710: Add symmetric sysop activate/deactivate CTAs in the provenance banner, routed through CommunityConfiguration - https://phabricator.wikimedia.org/T422710 [14:42:15] T422655: Synthesise article-like skin chrome and point its tabs at abstract.wikipedia.org - https://phabricator.wikimedia.org/T422655 [14:42:21] !log klausman@cumin2002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: ml-staging-worker@codfw [14:43:28] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1306652|abstractwiki: Show Abstract provenance notice to all readers, not just sysops (T422710)]], [[gerrit:1306709|abstractwiki: Stop mis-setting the relevant title on Abstract surfaces (T422655)]], [[gerrit:1306703|SkinComponentFooter: Show copyright for known, not just existing, pages (T422655)]] [14:45:29] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1306652|abstractwiki: Show Abstract provenance notice to all readers, not just sysops (T422710)]], [[gerrit:1306709|abstractwiki: Stop mis-setting the relevant title on Abstract surfaces (T422655)]], [[gerrit:1306703|SkinComponentFooter: Show copyright for known, not just existing, pages (T422655)]] synced to the testservers (see https://wikitech.wikimedia.org/w [14:45:29] iki/Mwdebug). Changes can now be verified there. [14:46:00] !log jforrester@deploy1003 jforrester: Continuing with deployment [14:46:06] !log klausman@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [14:47:13] !log klausman@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [14:47:13] !log klausman@cumin2002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: ml-staging-worker@codfw [14:48:45] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [14:49:05] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [14:50:23] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306652|abstractwiki: Show Abstract provenance notice to all readers, not just sysops (T422710)]], [[gerrit:1306709|abstractwiki: Stop mis-setting the relevant title on Abstract surfaces (T422655)]], [[gerrit:1306703|SkinComponentFooter: Show copyright for known, not just existing, pages (T422655)]] (duration: 06m 55s) [14:50:30] T422710: Add symmetric sysop activate/deactivate CTAs in the provenance banner, routed through CommunityConfiguration - https://phabricator.wikimedia.org/T422710 [14:50:30] T422655: Synthesise article-like skin chrome and point its tabs at abstract.wikipedia.org - https://phabricator.wikimedia.org/T422655 [14:53:41] (03PS1) 10Marostegui: installserveR: Do not format /srv on clouddb102[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/1306713 (https://phabricator.wikimedia.org/T409557) [14:54:21] !log roll restart eventgates to re-cache page_change related schemas - T423583#12067516 [14:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:29] T423583: mediawiki.page_change.v1 event - Add revision revert details - https://phabricator.wikimedia.org/T423583 [14:54:56] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [14:55:00] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync [14:55:11] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync [14:55:51] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync [14:55:52] (03Abandoned) 10Marostegui: installserveR: Do not format /srv on clouddb102[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/1306713 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [14:56:25] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync [14:57:00] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync [14:57:05] !log atsuko@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2095.codfw.wmnet with reason: host reimage [14:57:21] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: sync [14:57:29] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync [14:57:42] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync [14:58:08] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync [14:58:17] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync [14:58:38] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync [14:58:46] (03CR) 10Ladsgroup: [C:03+1] Puppet 8: Replace legacy facts (backup version) [puppet] - 10https://gerrit.wikimedia.org/r/1306693 (https://phabricator.wikimedia.org/T372666) (owner: 10Jcrespo) [14:59:07] !log atsuko@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2096.codfw.wmnet with reason: host reimage [14:59:11] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: sync [14:59:21] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [14:59:30] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy qwen36-27b in eager mode to fix startup timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306699 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [14:59:42] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [14:59:51] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 30818 [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: gettimeofday() says it's time for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1500) [15:00:08] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [15:00:13] !log atsuko@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2097.codfw.wmnet with reason: host reimage [15:00:47] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [15:00:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 30818 [15:00:54] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [15:02:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:03:21] (03PS1) 10Marostegui: mariadb: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306715 (https://phabricator.wikimedia.org/T372666) [15:03:37] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306715 (https://phabricator.wikimedia.org/T372666) (owner: 10Marostegui) [15:05:49] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2095.codfw.wmnet with reason: host reimage [15:07:34] (03PS1) 10Slyngshede: data.yaml: extend sarmbruster for another month [puppet] - 10https://gerrit.wikimedia.org/r/1306716 [15:08:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1306716 (owner: 10Slyngshede) [15:09:25] FIRING: SystemdUnitFailed: requestctl-credential-refresh.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:37] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2096.codfw.wmnet with reason: host reimage [15:10:42] (03CR) 10Dzahn: "well, the diff between the 2 proxy configs is more than just the protocol now. and yea, I wanted to change this pattern first independentl" [puppet] - 10https://gerrit.wikimedia.org/r/1306445 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [15:12:40] (03CR) 10Slyngshede: [C:03+2] data.yaml: extend sarmbruster for another month [puppet] - 10https://gerrit.wikimedia.org/r/1306716 (owner: 10Slyngshede) [15:12:55] (03CR) 10Dzahn: "oh yea, this is also because in the next change I wanted to be able to clearly/cleanly "stop including old config"" [puppet] - 10https://gerrit.wikimedia.org/r/1306445 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [15:13:21] (03CR) 10Dzahn: [C:03+2] contint: in httpd include proxy configs individually, not by wildcard [puppet] - 10https://gerrit.wikimedia.org/r/1306445 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [15:14:13] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2097.codfw.wmnet with reason: host reimage [15:15:25] (03CR) 10Marostegui: "PCC looks good, no differences: https://puppet-compiler.wmflabs.org/output/1306715/7102/" [puppet] - 10https://gerrit.wikimedia.org/r/1306715 (https://phabricator.wikimedia.org/T372666) (owner: 10Marostegui) [15:17:25] !log urbanecm@deploy1003 mwscript-k8s job started: GrowthExperiments:cleanMentorList.php --wiki=frwiki # T427386 [15:17:29] T427386: Deploy automated mentor list cleanup to Wikimedia wikis - https://phabricator.wikimedia.org/T427386 [15:18:11] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1262: Migration of db1262.eqiad.wmnet completed [15:18:12] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:18:46] (03PS1) 10Urbanecm: Revert "[Growth] frwiki: Deploy automated mentor list cleaner" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306718 (https://phabricator.wikimedia.org/T427386) [15:18:53] jouncebot: nowandnext [15:18:53] For the next 0 hour(s) and 41 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1500) [15:18:54] In 0 hour(s) and 41 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1600) [15:19:03] (03CR) 10Urbanecm: [C:03+2] Revert "[Growth] frwiki: Deploy automated mentor list cleaner" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306718 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [15:20:21] (03Merged) 10jenkins-bot: Revert "[Growth] frwiki: Deploy automated mentor list cleaner" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306718 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [15:20:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1306715 (https://phabricator.wikimedia.org/T372666) (owner: 10Marostegui) [15:23:10] (03CR) 10Ladsgroup: [C:03+1] mariadb: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306715 (https://phabricator.wikimedia.org/T372666) (owner: 10Marostegui) [15:23:30] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1306718|Revert "[Growth] frwiki: Deploy automated mentor list cleaner" (T427386)]] [15:23:30] (03CR) 10Marostegui: [C:03+2] mariadb: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306715 (https://phabricator.wikimedia.org/T372666) (owner: 10Marostegui) [15:23:34] T427386: Deploy automated mentor list cleanup to Wikimedia wikis - https://phabricator.wikimedia.org/T427386 [15:24:47] (03CR) 10Dzahn: "I would do 300 or ask for +1 from traffic first. I haven't seen anyone going under 300 before and remember discussion about how low the mi" [dns] - 10https://gerrit.wikimedia.org/r/1306459 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [15:25:40] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1306718|Revert "[Growth] frwiki: Deploy automated mentor list cleaner" (T427386)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:26:55] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. The reason phab2002 is failing is because it no longer exists:" [puppet] - 10https://gerrit.wikimedia.org/r/1306693 (https://phabricator.wikimedia.org/T372666) (owner: 10Jcrespo) [15:27:04] urbanecm: I'd like to backport something with i18n changes when you're done [15:27:09] !log urbanecm@deploy1003 urbanecm: Continuing with deployment [15:27:15] kostajh: sure, i'll ping you then [15:28:42] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276403 (owner: 10Giuseppe Lavagetto) [15:28:43] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2095.codfw.wmnet with OS trixie [15:29:23] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2096.codfw.wmnet with OS trixie [15:31:31] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306718|Revert "[Growth] frwiki: Deploy automated mentor list cleaner" (T427386)]] (duration: 08m 01s) [15:31:36] T427386: Deploy automated mentor list cleanup to Wikimedia wikis - https://phabricator.wikimedia.org/T427386 [15:31:42] kostajh: i'm done [15:32:04] thanks [15:32:38] (03PS2) 10Hnowlan: docker_registry: migrate nrpe checks to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1306351 (https://phabricator.wikimedia.org/T384321) [15:32:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306689 (https://phabricator.wikimedia.org/T428892) (owner: 10Kosta Harlan) [15:32:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306688 (https://phabricator.wikimedia.org/T428892) (owner: 10Kosta Harlan) [15:33:06] (03CR) 10Hnowlan: "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1306351 (https://phabricator.wikimedia.org/T384321) (owner: 10Hnowlan) [15:35:43] (03PS1) 10Slyngshede: P:cache::haproxy bump Docker image version [puppet] - 10https://gerrit.wikimedia.org/r/1306723 [15:36:25] !log atsuko@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2097.codfw.wmnet with OS trixie [15:37:38] FIRING: CertAlmostExpired: gNMI TLS certificate for lsw1-d2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:37:41] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [15:37:42] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [15:37:52] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [15:38:01] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1263: Upgrading db1263.eqiad.wmnet [15:38:02] (03CR) 10Jcrespo: [C:03+2] Puppet 8: Replace legacy facts (backup version) [puppet] - 10https://gerrit.wikimedia.org/r/1306693 (https://phabricator.wikimedia.org/T372666) (owner: 10Jcrespo) [15:38:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1263: Upgrading db1263.eqiad.wmnet [15:42:11] (03Merged) 10jenkins-bot: ConfirmEdit: Show a clear message when a login CAPTCHA is missing [extensions/ConfirmEdit] (wmf/1.47.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1306689 (https://phabricator.wikimedia.org/T428892) (owner: 10Kosta Harlan) [15:42:14] (03Merged) 10jenkins-bot: ConfirmEdit: Show a clear message when a login CAPTCHA is missing [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306688 (https://phabricator.wikimedia.org/T428892) (owner: 10Kosta Harlan) [15:42:42] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1306689|ConfirmEdit: Show a clear message when a login CAPTCHA is missing (T428892)]], [[gerrit:1306688|ConfirmEdit: Show a clear message when a login CAPTCHA is missing (T428892)]] [15:42:48] T428892: Cannot login: incorrectly claims wrong username and password - https://phabricator.wikimedia.org/T428892 [15:42:50] (03PS1) 10Dzahn: ci: replace legacy facts, servers in production [puppet] - 10https://gerrit.wikimedia.org/r/1306725 (https://phabricator.wikimedia.org/T372666) [15:43:26] cwilliams@cumin1003 major-upgrade (PID 148543) is awaiting input [15:44:15] (03PS2) 10JHathaway: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1305988 (https://phabricator.wikimedia.org/T372666) [15:44:30] (03CR) 10Kamila Součková: [C:03+1] admin_ng: Fix comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305542 (owner: 10RLazarus) [15:44:53] (03CR) 10RLazarus: [C:03+2] admin_ng: Fix comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305542 (owner: 10RLazarus) [15:44:56] (03CR) 10CI reject: [V:04-1] Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1305988 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [15:48:04] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1276403 (owner: 10Giuseppe Lavagetto) [15:49:30] (03CR) 10Dzahn: [C:04-2] "https://puppet-compiler.wmflabs.org/output/1306725/8814/contint2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1306725 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [15:50:15] (03PS2) 10Dzahn: ci: replace legacy facts, servers in production [puppet] - 10https://gerrit.wikimedia.org/r/1306725 (https://phabricator.wikimedia.org/T372666) [15:52:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:30] (03Merged) 10jenkins-bot: admin_ng: Fix comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305542 (owner: 10RLazarus) [15:56:40] (03CR) 10Btullis: [C:03+2] ceph-csi-rbd: add datahub-next to the tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306706 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [15:56:51] (03CR) 10Btullis: [C:03+2] opensearch-operator: watch the datahub-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306707 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [16:00:05] jhathaway and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1600). [16:00:05] sfaci: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:24] hello! looking [16:00:24] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1306725/8815/" [puppet] - 10https://gerrit.wikimedia.org/r/1306725 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [16:00:27] nearly done with my backport [16:00:42] rzl: will let you know when it finishes syncing. it has i18n changes, so is taking longer [16:00:51] I'm here! [16:01:06] kostajh: this is just a puppet patch on a maintenance script, fine to do them at the same time :) [16:02:07] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1306689|ConfirmEdit: Show a clear message when a login CAPTCHA is missing (T428892)]], [[gerrit:1306688|ConfirmEdit: Show a clear message when a login CAPTCHA is missing (T428892)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:02:12] T428892: Cannot login: incorrectly claims wrong username and password - https://phabricator.wikimedia.org/T428892 [16:02:17] (03PS3) 10JHathaway: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1305985 (https://phabricator.wikimedia.org/T372666) [16:02:51] !log kharlan@deploy1003 kharlan: Continuing with deployment [16:03:00] sure, just wanted to give you a heads up [16:03:17] sfaci: no action needed, but, for reference: when you do the "check experimental" puppet compiler thing in CI, you should include a "Hosts:" trailer on your commit message to tell it what hosts to compile -- in this case, deploy1003.eqiad.wmnet [16:03:46] if you don't do that, it says "every host I can think of? you got it, boss" and then it works as hard as it can for three hours and then passes out [16:04:04] Oops! I'm sorry. I didn't realize that [16:04:08] I think there is a task open to change that default behavior to something else, which is probably a good idea [16:04:26] no worries! just wanted to explain the CI failure is nothing to worry about [16:04:54] Thanks! [16:05:20] (03Merged) 10jenkins-bot: ceph-csi-rbd: add datahub-next to the tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306706 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [16:05:28] (03Merged) 10jenkins-bot: opensearch-operator: watch the datahub-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306707 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [16:09:25] RESOLVED: SystemdUnitFailed: requestctl-credential-refresh.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:37] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8816/co" [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) (owner: 10Clare Ming) [16:09:42] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/0/0 (Transport: Arelion (IC-398711) {#changeme_codfw_arelion_cct}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:09:42] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:03] (03CR) 10Atsuko: [C:03+1] datahub-next: deploy a dedicated OpenSearch 2.x cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306708 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [16:10:07] (03PS4) 10CDanis: cache::haproxy: Enable jwt for upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) (owner: 10BCornwall) [16:10:11] (03CR) 10CDanis: "There's a specific OSM community use we'd like to support. They have their own thumbnail caching infra and offered to attach jwts, is my " [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) (owner: 10BCornwall) [16:10:18] (03PS2) 10Btullis: datahub-next: deploy a dedicated OpenSearch 2.x cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306708 (https://phabricator.wikimedia.org/T402408) [16:11:09] sfaci: oops, I'm glad I ran that actually. I missed that you're renaming the resource that was originally there -- that's fine, but you should add a resource with the original name and "ensure => absent" so that it gets deleted, otherwise Puppet will just stop managing it and you'll have all three [16:11:33] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1263.eqiad.wmnet with OS trixie [16:11:38] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) (owner: 10BCornwall) [16:11:38] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) (owner: 10BCornwall) [16:11:54] sfaci: want to make that change now, or would you like me to edit it for you? [16:12:24] (03CR) 10CI reject: [V:04-1] datahub-next: deploy a dedicated OpenSearch 2.x cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306708 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [16:12:53] Do you mean the old maintenance job because its name is not used anymore? [16:12:56] yeah, exactly [16:13:15] as far as Puppet is concerned, this is two new jobs and one abandoned one [16:13:54] if you can do that change, it would be great [16:14:00] will do [16:14:06] Thank you very much! [16:14:36] (03PS38) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [16:14:38] I didn't realize but I think I have seen before what you mean. We should have been explicit instead of just removing the old one [16:14:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:45] (03CR) 10CDobbins: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [16:15:10] (03PS1) 10BCornwall: Revert "geo-maps: switch CN to to eqsin (from ulsfo)" [dns] - 10https://gerrit.wikimedia.org/r/1306727 [16:15:17] (03CR) 10CI reject: [V:04-1] varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [16:16:25] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306689|ConfirmEdit: Show a clear message when a login CAPTCHA is missing (T428892)]], [[gerrit:1306688|ConfirmEdit: Show a clear message when a login CAPTCHA is missing (T428892)]] (duration: 33m 43s) [16:16:30] T428892: Cannot login: incorrectly claims wrong username and password - https://phabricator.wikimedia.org/T428892 [16:16:52] (03CR) 10BCornwall: "Whether this is something we want to do (there's no mention of it in the robot policy page for upload) is potentially up for debate: Is th" [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) (owner: 10BCornwall) [16:17:42] (03CR) 10BCornwall: [C:03+1] P:cache::haproxy bump Docker image version [puppet] - 10https://gerrit.wikimedia.org/r/1306723 (owner: 10Slyngshede) [16:19:15] (03PS6) 10RLazarus: Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) (owner: 10Clare Ming) [16:19:51] (03CR) 10Atsuko: [C:03+1] datahub-next: deploy a dedicated OpenSearch 2.x cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306708 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [16:21:23] (rerunning pcc, almost there) [16:21:38] 👍 [16:21:49] (03CR) 10Fabfur: [C:03+2] cache_misc: apply traffic classification [puppet] - 10https://gerrit.wikimedia.org/r/1276403 (owner: 10Giuseppe Lavagetto) [16:22:52] orrrr I'll learn something about how this class is set up. one more try, thanks for your patience [16:24:05] 06SRE, 06Editing-team, 06Fundraising-Backlog, 06Traffic-Icebox, and 5 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085#12072011 (102313Prisonpecia) sa [16:24:35] (03CR) 10CDanis: "Yeah, fair. In the general case I think it makes some sense, and we should at least discuss it. This has potential applications not just" [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) (owner: 10BCornwall) [16:24:52] (03PS3) 10Btullis: datahub-next: deploy a dedicated OpenSearch 2.x cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306708 (https://phabricator.wikimedia.org/T402408) [16:25:54] (03PS1) 10Fabfur: Revert "cache_misc: apply traffic classification" [puppet] - 10https://gerrit.wikimedia.org/r/1306730 [16:27:14] (03PS7) 10RLazarus: Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) (owner: 10Clare Ming) [16:27:19] (03CR) 10Btullis: datahub-next: deploy a dedicated OpenSearch 2.x cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306708 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [16:27:31] (03CR) 10Btullis: [C:03+2] datahub-next: deploy a dedicated OpenSearch 2.x cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306708 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [16:28:02] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1263.eqiad.wmnet with reason: host reimage [16:28:04] (Done with my deploy) [16:28:40] 06SRE, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 10Event-Platform: Flink Page View: Create K8s resources - https://phabricator.wikimedia.org/T426425#12072096 (10Ahoelzl) 05Open→03Resolved [16:28:43] 👍 [16:29:38] (03Merged) 10jenkins-bot: datahub-next: deploy a dedicated OpenSearch 2.x cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306708 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [16:29:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:30:06] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8818/co" [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) (owner: 10Clare Ming) [16:30:43] (03PS1) 10Fabfur: varnish: fix typo in misc template [puppet] - 10https://gerrit.wikimedia.org/r/1306731 [16:31:33] (03CR) 10Fabfur: [C:03+2] varnish: fix typo in misc template [puppet] - 10https://gerrit.wikimedia.org/r/1306731 (owner: 10Fabfur) [16:31:47] there we go. I copied the testwiki command instead of the aawiki one, so there's an extra line in the pcc diff, but that doesn't actually matter, what matters is the ensure present->absent line [16:32:23] sfaci: going ahead if you're still around to monitor :) [16:32:33] !log bking@cumin2003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2073.codfw.wmnet with OS trixie [16:33:19] (03CR) 10RLazarus: [V:03+1 C:03+2] Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) (owner: 10Clare Ming) [16:34:01] I'm still here! [16:34:05] rad :) [16:34:32] puppet-merging now, then running puppet on the deploy host, which'll take a few minutes, I'll let you know when I'm running helmfile apply which will actually stop the old job and start the new ones [16:34:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:34:47] !log bking@cumin2003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2089.codfw.wmnet'] [16:34:50] ok! [16:35:04] ^ WidespreadPuppetFailure is unrelated, that's another issue that f.abfur is working on [16:35:06] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1263.eqiad.wmnet with reason: host reimage [16:35:26] you also copied the new script_label for the job that will be absent, is that ok? [16:35:37] yeah, sloppy but fine :) [16:35:53] ok ok, the relevant part is the `ensure => absent` one, got it [16:35:57] when it's absent none of those fields will actually be used for anything, some of them are just syntactically required anyway [16:36:15] instead of saying "don't create a job with this script_label," I said "don't create a job with that script_label" [16:36:26] but, because it won't create a job at all, [16:36:39] cool! [16:36:53] PROBLEM - Host cirrussearch2089 is DOWN: PING CRITICAL - Packet loss = 100% [16:36:53] (03PS1) 10Fabfur: Revert "varnish: fix typo in misc template" [puppet] - 10https://gerrit.wikimedia.org/r/1306733 [16:37:33] (03CR) 10Fabfur: [C:03+2] Revert "varnish: fix typo in misc template" [puppet] - 10https://gerrit.wikimedia.org/r/1306733 (owner: 10Fabfur) [16:37:37] after we're happy here, the last thing is just to send a followup patch that removes the ensure => absent job -- that'll be a no-op but it's a good cleanup step [16:37:48] (03PS2) 10Fabfur: Revert "cache_misc: apply traffic classification" [puppet] - 10https://gerrit.wikimedia.org/r/1306730 [16:37:50] I was typing a question about that right now [16:38:03] ok ok [16:38:27] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:38:32] the absent thing is only for the deployment to stop the job and then, we can clean up the code to leave only the two maintenance jobs we want to be running since now [16:38:41] is that right? [16:38:43] yeah exactly [16:39:01] Perfect! Thanks Reuven for this master class about maintenance jobs [16:39:03] puppet is declarative, right, you tell it "there should be such-and-such an object" and if the object doesn't already exist, it makes it so [16:39:15] anything you *don't* mention in puppet, puppet won't care about one way or the other -- won't create it, won't delete it [16:39:20] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:39:41] Ah, like running commands [16:39:44] so if it exists and you want puppet to get rid of it, the declarative statement is "there shouldn't be any object by this name" -- but after it's deleted, you don't need that statement anymore [16:39:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:39:50] I wasn't seeing that code like that [16:40:06] (03CR) 10Fabfur: [C:03+2] Revert "cache_misc: apply traffic classification" [puppet] - 10https://gerrit.wikimedia.org/r/1306730 (owner: 10Fabfur) [16:40:16] first time dealing with maintenance jobs [16:40:17] !log bking@cumin2003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cirrussearch2089.codfw.wmnet'] [16:40:43] RECOVERY - Host cirrussearch2089 is UP: PING OK - Packet loss = 0%, RTA = 32.99 ms [16:43:52] sfaci: okay, deploying now [16:43:58] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:44:08] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [16:44:11] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [16:44:27] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:44:39] 👍 [16:44:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:44:58] done! let me know if everything looks good after the first run [16:45:35] Puppet failures are known and being serviced - bad varnish updated [16:46:42] (03PS2) 10Gergő Tisza: SecurityLogs: Create by moving code from mediawiki-config [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306735 (https://phabricator.wikimedia.org/T430564) [16:48:24] sfaci: looks like they're completing successfully at least, are you happy with what you're seeing? [16:48:49] I was still trying to take a look [16:49:04] Now I'm seeing both jobs running fine [16:49:12] oh sorry, no rush :) [16:49:17] It seems they are running fine! [16:49:29] Sorry! Looking for the right logstash dashboard [16:49:37] All good Reuven [16:49:43] awesome [16:49:44] I see them [16:49:47] (03PS1) 10Gergő Tisza: SecurityLogs: Add tests [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306737 (https://phabricator.wikimedia.org/T430564) [16:49:48] (03CR) 10Atsuko: [C:03+1] datahub-next: deploy a dedicated OpenSearch 2.x cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306708 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [16:49:49] Thank you very much!! [16:49:56] want to send that deletion patch for practice, or would you like me to send it to you? [16:50:01] Good learnings as well! [16:50:07] I'll do it! [16:50:12] you have done too much already [16:50:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306735 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [16:50:19] and, as you mentioned, I need to practice [16:50:24] no need to schedule that one in a puppet window or anything, we can just merge it since there'll be nothing to monitor [16:50:31] just add me as a reviewer :) [16:50:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306737 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [16:50:37] I'll do it [16:50:42] sounds good! [16:50:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306141 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [16:50:44] Understood! [16:50:48] Thank you very much! [16:51:16] (03CR) 10BCornwall: [C:04-1] "Actually, no, this version was pinned because we require bullseye-backports, which was archived in versions later than this. So this will " [puppet] - 10https://gerrit.wikimedia.org/r/1306723 (owner: 10Slyngshede) [16:51:27] thanks for flying Puppet Window, please ensure you have all your belongings as you exit [16:51:38] jajajaja [16:51:45] Nice flight! [16:52:51] (03PS39) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [16:52:53] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1263.eqiad.wmnet with OS trixie [16:54:48] (03CR) 10CI reject: [V:04-1] SecurityLogs: Add tests [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306737 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [16:54:58] (03PS1) 10ArielGlenn: ExtensionDistributor: mark 1.46 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306739 (https://phabricator.wikimedia.org/T423272) [16:55:42] (03PS1) 10Dzahn: aphlict: replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306740 (https://phabricator.wikimedia.org/T372666) [16:56:14] (03CR) 10Jforrester: [C:04-1] ExtensionDistributor: mark 1.46 as stable (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306739 (https://phabricator.wikimedia.org/T423272) (owner: 10ArielGlenn) [16:59:19] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1306740/8823/" [puppet] - 10https://gerrit.wikimedia.org/r/1306740 (https://phabricator.wikimedia.org/T372666) (owner: 10Dzahn) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1700) [17:04:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:08:33] (03PS2) 10ArielGlenn: ExtensionDistributor: mark 1.46 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306739 (https://phabricator.wikimedia.org/T423272) [17:08:49] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:09:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:09:49] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:13:27] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304#12072406 (10KFrancis) Hi all, the NDA is complete. Thanks! [17:14:45] RESOLVED: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:18:23] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1263: Migration of db1263.eqiad.wmnet completed [17:21:01] PROBLEM - SSH on urldownloader1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:21:42] FIRING: [2x] ProbeDown: Service urldownloader1005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:30:09] (03PS3) 10Ayounsi: depool-rack: run the k8s cookbook with relevant alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1306551 (https://phabricator.wikimedia.org/T327300) [17:30:09] (03PS2) 10Ayounsi: rack depool: use build in reason fuction [cookbooks] - 10https://gerrit.wikimedia.org/r/1306637 [17:34:30] (03CR) 10FNegri: [C:03+1] eqiad.yaml: Add clouddb1027 [puppet] - 10https://gerrit.wikimedia.org/r/1306677 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [17:35:48] hCaptcha is in failover, did someone just deploy a change to the urldownloaders? [17:36:01] uh? [17:36:06] since when? [17:36:12] Last 10 minutes [17:36:24] Plus "Service urldownloader1005:8080 has failed probes" is firing above [17:36:33] 1005 should not be in production I think? [17:36:35] checking [17:36:59] +url-downloader.eqiad 5M IN CNAME urldownloader1005 [17:37:04] so we did change it, but in the morning [17:37:20] the host is down [17:37:21] Yeah, 1,400 calls to siteverify failed [17:37:22] sigh [17:37:39] reverting I guess [17:37:56] > All traffic now goes via the new nodes. Tomorrow I'll switch the old proxies into the insetup role and stop Squid (and keep them around for a grace period) [17:37:58] Thanks [17:38:14] codfw one is fine [17:38:27] (03PS1) 10Ssingh: Revert "Failover url-downloader.eqiad CNAME to one of the new Trixie hosts" [dns] - 10https://gerrit.wikimedia.org/r/1306741 [17:38:34] (03CR) 10CI reject: [V:04-1] Revert "Failover url-downloader.eqiad CNAME to one of the new Trixie hosts" [dns] - 10https://gerrit.wikimedia.org/r/1306741 (owner: 10Ssingh) [17:39:58] (03PS1) 10Ssingh: wikimedia.org: revert urldownloader in eqiad to 1003 [dns] - 10https://gerrit.wikimedia.org/r/1306742 (https://phabricator.wikimedia.org/T427282) [17:40:24] Dreamy_Jazz: we will move urldownloader behind an LB shortly and will introduce more redundancy tha tway [17:40:41] "shortly" -> sometime around early next week if all goes well [17:40:49] Thanks [17:40:57] I'll keep an eye on failover for hCaptcha [17:41:46] (03CR) 10Ssingh: [C:03+2] wikimedia.org: revert urldownloader in eqiad to 1003 [dns] - 10https://gerrit.wikimedia.org/r/1306742 (https://phabricator.wikimedia.org/T427282) (owner: 10Ssingh) [17:41:58] !log sukhe@dns1004 START - running authdns-update [17:43:58] !log sukhe@dns1004 END - running authdns-update [17:44:45] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache url-downloader.eqiad.wikimedia.org on all recursors [17:44:49] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) url-downloader.eqiad.wikimedia.org on all recursors [17:45:06] Dreamy_Jazz: we should be back now [17:45:35] (03CR) 10Jforrester: [C:03+1] ExtensionDistributor: mark 1.46 as stable (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306739 (https://phabricator.wikimedia.org/T423272) (owner: 10ArielGlenn) [17:46:24] Yep, failover has stopped [17:46:25] Thanks! [17:46:35] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#12072623 (10ssingh) I haven't caught up with the ticket yet as I was out but note that `urldownloader1005` failed again (just eqiad), so the above change reverts `url-downloader... [17:46:39] sorry about this [17:46:46] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps backups: exclude xtools worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1305970 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [17:47:48] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2110.codfw.wmnet with OS trixie [17:47:55] (03CR) 10Andrew Bogott: [C:03+2] Superset, Quarry: Open security groups from cumin to magnum workers [puppet] - 10https://gerrit.wikimedia.org/r/1304157 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [17:48:17] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3081.* [17:48:42] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 06Traffic: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12072629 (10ssingh) After some discussion in Traffic and confirming with @BBlack, we are going with Option 3, putting this behind the... [17:48:59] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2079.codfw.wmnet with OS trixie [17:52:17] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2065.codfw.wmnet with OS trixie [17:53:09] RECOVERY - SSH on urldownloader1005 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:54:31] (03PS1) 10BCornwall: cp3081: Set single nvme drive and single_backend [puppet] - 10https://gerrit.wikimedia.org/r/1306743 (https://phabricator.wikimedia.org/T288106) [17:54:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306739 (https://phabricator.wikimedia.org/T423272) (owner: 10ArielGlenn) [17:55:40] (03CR) 10Ssingh: [C:03+1] cp3081: Set single nvme drive and single_backend [puppet] - 10https://gerrit.wikimedia.org/r/1306743 (https://phabricator.wikimedia.org/T288106) (owner: 10BCornwall) [17:56:42] RESOLVED: [2x] ProbeDown: Service urldownloader1005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:56:55] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8833/co" [puppet] - 10https://gerrit.wikimedia.org/r/1306743 (https://phabricator.wikimedia.org/T288106) (owner: 10BCornwall) [17:57:14] (03CR) 10BCornwall: [V:03+1 C:03+2] cp3081: Set single nvme drive and single_backend [puppet] - 10https://gerrit.wikimedia.org/r/1306743 (https://phabricator.wikimedia.org/T288106) (owner: 10BCornwall) [17:58:10] PROBLEM - SSH on urldownloader1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:58:42] FIRING: [2x] ProbeDown: Service urldownloader1005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:52] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS trixie [18:00:04] jouncebot: before you even ask, the answer is: No [18:00:05] andre and brennen: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T1800). [18:00:50] (03PS1) 10Ahmon Dancy: profile::puppet::agent: write pinned CA cert with a trailing newline [puppet] - 10https://gerrit.wikimedia.org/r/1306745 (https://phabricator.wikimedia.org/T429413) [18:02:53] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read views for 50% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306746 (https://phabricator.wikimedia.org/T430194) [18:02:59] (03PS2) 10Ahmon Dancy: profile::puppet::agent: write pinned CA cert with a trailing newline [puppet] - 10https://gerrit.wikimedia.org/r/1306745 (https://phabricator.wikimedia.org/T429413) [18:03:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306746 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [18:03:54] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1263: Migration of db1263.eqiad.wmnet completed [18:03:55] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [18:05:24] (03CR) 10Subramanya Sastry: [C:03+1] Turn on Parsoid Read views for 50% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306746 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [18:06:13] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2079.codfw.wmnet with reason: host reimage [18:06:26] o/ nothing for this window. [18:07:01] train't [18:07:04] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2110.codfw.wmnet with reason: host reimage [18:08:58] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2079.codfw.wmnet with reason: host reimage [18:10:21] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2065.codfw.wmnet with reason: host reimage [18:10:35] (03PS1) 10Santiago Faci: NOOP change to clean up/remove an absent job after last deployment [puppet] - 10https://gerrit.wikimedia.org/r/1306747 (https://phabricator.wikimedia.org/T422209) [18:11:40] (03CR) 10Ahmon Dancy: "Bugfix" [puppet] - 10https://gerrit.wikimedia.org/r/1306745 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [18:12:00] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2110.codfw.wmnet with reason: host reimage [18:17:20] 10ops-codfw, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 06Discovery-Search (2026.06.01 - 2026.07.03): cirrussearch2089.codfw.wmnet fails reimage cookbook (DRAC error)? - https://phabricator.wikimedia.org/T430726 (10bking) 03NEW [18:18:06] 10ops-codfw, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 06Discovery-Search (2026.06.01 - 2026.07.03): cirrussearch2089.codfw.wmnet fails reimage cookbook (DRAC error)? - https://phabricator.wikimedia.org/T430726#12072856 (10bking) [18:18:36] PROBLEM - Elasticsearch HTTPS for production-search-codfw on cirrussearch2065 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [18:19:39] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2065.codfw.wmnet with reason: host reimage [18:23:24] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3081.esams.wmnet with reason: host reimage [18:23:50] (03PS2) 10BCornwall: varnish: Update tests container image to Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1306723 (owner: 10Slyngshede) [18:25:13] (03PS3) 10BCornwall: varnish: Update tests container image to Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1306723 (https://phabricator.wikimedia.org/T401832) (owner: 10Slyngshede) [18:27:14] (03CR) 10CDobbins: [C:03+1] varnish: Update tests container image to Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1306723 (https://phabricator.wikimedia.org/T401832) (owner: 10Slyngshede) [18:28:39] RECOVERY - Elasticsearch HTTPS for production-search-codfw on cirrussearch2065 is OK: SSL OK - Certificate cirrussearch2065.codfw.wmnet valid until 2026-07-28 18:23:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [18:29:09] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3081.esams.wmnet with reason: host reimage [18:29:45] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2079.codfw.wmnet with OS trixie [18:30:53] (03CR) 10BCornwall: [V:03+2 C:03+2] "Tests pass for both text and upload using podman" [puppet] - 10https://gerrit.wikimedia.org/r/1306723 (https://phabricator.wikimedia.org/T401832) (owner: 10Slyngshede) [18:32:01] RECOVERY - SSH on urldownloader1005 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:33:42] RESOLVED: [2x] ProbeDown: Service urldownloader1005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:35:03] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2110.codfw.wmnet with OS trixie [18:37:11] PROBLEM - SSH on urldownloader1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:37:42] FIRING: [2x] ProbeDown: Service urldownloader1005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:40:30] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2065.codfw.wmnet with OS trixie [18:40:49] (03CR) 10RLazarus: [C:03+2] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1306747 (https://phabricator.wikimedia.org/T422209) (owner: 10Santiago Faci) [18:54:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3081.esams.wmnet with OS trixie [18:55:30] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3081.* [18:58:28] 14SRE-Sprint-Week-Sustainability-March2023, 06Traffic, 13Patch-For-Review, 07Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106#12072995 (10BCornwall) As of this comment cp3081 is now running on single-backend and with one drive [19:02:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:11:07] RECOVERY - SSH on urldownloader1005 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:12:42] RESOLVED: [2x] ProbeDown: Service urldownloader1005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:13:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:13:24] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2082.codfw.wmnet with OS trixie [19:15:26] (03PS1) 10Btullis: opensearch-cluster: allow sizing the bootstrap pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306751 (https://phabricator.wikimedia.org/T402408) [19:15:53] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2066.codfw.wmnet with OS trixie [19:17:17] PROBLEM - SSH on urldownloader1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:17:49] PROBLEM - Host cirrussearch2089 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306414 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [19:18:57] FIRING: [2x] ProbeDown: Service urldownloader1005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:57] (03CR) 10Btullis: [C:03+2] opensearch-cluster: allow sizing the bootstrap pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306751 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [19:21:33] (03CR) 10Bking: "Don't forget to update charts/opensearch-cluster/CHANGELOG.md when you make changes to the chart like this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306751 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [19:24:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:24:56] (03Merged) 10jenkins-bot: opensearch-cluster: allow sizing the bootstrap pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306751 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [19:30:47] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2082.codfw.wmnet with reason: host reimage [19:31:01] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2101.codfw.wmnet with OS trixie [19:31:56] (03PS1) 10Btullis: datahub-next: size the OpenSearch bootstrap pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306755 (https://phabricator.wikimedia.org/T402408) [19:32:39] (03PS2) 10Btullis: datahub-next: size the OpenSearch bootstrap pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306755 (https://phabricator.wikimedia.org/T402408) [19:33:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:34:21] (03PS3) 10Btullis: datahub-next: size the OpenSearch bootstrap pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306755 (https://phabricator.wikimedia.org/T402408) [19:34:25] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2066.codfw.wmnet with reason: host reimage [19:34:56] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2082.codfw.wmnet with reason: host reimage [19:36:12] (03CR) 10Bking: [C:03+1] datahub-next: size the OpenSearch bootstrap pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306755 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [19:38:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:38:31] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:39:08] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2066.codfw.wmnet with reason: host reimage [19:43:19] RECOVERY - Host cirrussearch2089 is UP: PING OK - Packet loss = 0%, RTA = 32.92 ms [19:43:20] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:43:31] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:44:32] (03PS1) 10Clare Ming: Add Test Kitchen config for draft validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306757 (https://phabricator.wikimedia.org/T429420) [19:46:39] FIRING: CirrusSearchThreadPoolRejectionsTooHigh: cirrussearch1086-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [19:47:08] (03PS2) 10Clare Ming: Add Test Kitchen config for draft validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306757 (https://phabricator.wikimedia.org/T429420) [19:48:51] (03CR) 10Gergő Tisza: "recheck" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306737 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [19:50:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2089.codfw.wmnet with OS trixie [19:51:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 06Discovery-Search (2026.06.01 - 2026.07.03): cirrussearch2089.codfw.wmnet fails reimage cookbook (DRAC error)? - https://phabricator.wikimedia.org/T430726#12073326 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was... [19:51:16] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2101.codfw.wmnet with reason: host reimage [19:52:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:20] RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:55:30] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2101.codfw.wmnet with reason: host reimage [19:56:13] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.4.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306759 (https://phabricator.wikimedia.org/T406576) [19:56:48] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2082.codfw.wmnet with OS trixie [19:58:20] RESOLVED: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T2000). [20:00:05] chlod, tgr, apergos, cscott, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] o/ [20:00:12] o/ [20:00:15] \o [20:00:39] my config patch is pretty trivial and could probably ride with other configs [20:00:48] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2066.codfw.wmnet with OS trixie [20:03:14] (03CR) 10Btullis: [C:03+2] datahub-next: size the OpenSearch bootstrap pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306755 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [20:04:11] who's running the window, anybody? [20:04:12] (03PS1) 10Cwhite: beta-logs: set logstash heap memory and batch size [puppet] - 10https://gerrit.wikimedia.org/r/1306761 [20:04:30] (03PS2) 10Cwhite: beta-logs: set logstash heap memory and batch size [puppet] - 10https://gerrit.wikimedia.org/r/1306761 [20:05:25] i don't have deployer access :( [20:05:27] (03Merged) 10jenkins-bot: datahub-next: size the OpenSearch bootstrap pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306755 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [20:05:46] (03CR) 10Cwhite: [C:03+2] beta-logs: set logstash heap memory and batch size [puppet] - 10https://gerrit.wikimedia.org/r/1306761 (owner: 10Cwhite) [20:06:19] i suppose i can deploy, sec [20:06:41] I could also deploy for chlod if needed [20:07:05] * TheresNoTime is also now around [20:07:05] \o/ [20:07:22] *almost like someone said something* [20:07:27] lol [20:07:39] (because volunteers get priority IMHO so they don’t have to wait around forever; that said I’m not planning to spend the whole window deploying for WMF staff while I’m not supposed to be working, just FTR :D) [20:07:44] ok so who’s doing it? :D [20:07:46] (03PS1) 10Btullis: opensearch-cluster: Retrospectively update the CHANGELOG [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306762 (https://phabricator.wikimedia.org/T402408) [20:07:57] any other configs i should chip same time? apergos cscott [20:08:01] s/chip/ship/ [20:08:26] mine is just extensiondistributor which is on mediawiki.org only so that's a good candidate [20:08:44] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [20:08:48] Lucas_WMDE: i should be able to do these and tgr can handle the extension deploy [20:08:49] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [20:08:56] ebernhardson: sounds good, thank you! [20:08:59] * Lucas_WMDE vanishes into the night [20:09:09] RESOLVED: CirrusSearchThreadPoolRejectionsTooHigh: cirrussearch1086-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [20:09:22] I can self-deploy too if desired, but if it's being combined up with someone else's,that's fine too [20:09:29] haven't heard from cscott yet, going to start these [20:09:44] plus yours too, right? [20:09:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306414 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:09:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306210 (https://phabricator.wikimedia.org/T424519) (owner: 10Chlod Alejandro) [20:09:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306739 (https://phabricator.wikimedia.org/T423272) (owner: 10ArielGlenn) [20:09:48] ya [20:09:50] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:09:53] great [20:10:22] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-esams rpd failure after enabling bgp 'graceful-shutdown' (June 2026) - https://phabricator.wikimedia.org/T429386#12073435 (10cmooney) 05Open→03Resolved [20:10:47] (03Merged) 10jenkins-bot: Revert^3 "cirrus: AB test query suggester variants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306414 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:10:51] (03Merged) 10jenkins-bot: Revert "nlwiki: change to Wikipedia 25 logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306210 (https://phabricator.wikimedia.org/T424519) (owner: 10Chlod Alejandro) [20:10:55] (03Merged) 10jenkins-bot: ExtensionDistributor: mark 1.46 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306739 (https://phabricator.wikimedia.org/T423272) (owner: 10ArielGlenn) [20:11:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2089.codfw.wmnet with reason: host reimage [20:11:24] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1306414|Revert^3 "cirrus: AB test query suggester variants" (T407432)]], [[gerrit:1306210|Revert "nlwiki: change to Wikipedia 25 logo" (T424519)]], [[gerrit:1306739|ExtensionDistributor: mark 1.46 as stable (T423272)]] [20:11:34] T407432: Follow-up AB test of dym language model variants - https://phabricator.wikimedia.org/T407432 [20:11:35] T424519: Per community Rfc, for the month of June please change to the birthday logo for nl.wikipedia.org - https://phabricator.wikimedia.org/T424519 [20:11:35] T423272: Mark REL1_46 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T423272 [20:13:06] oh you are spiderpigging it, heh [20:13:39] it's the easy way :) [20:14:24] o/ [20:14:25] I'm still enjoying how much easier scap backport is than the previous very longwinded procedure! [20:14:44] sorry, i'm a little late, but i can spiderpig mine when everyone else is done [20:15:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2089.codfw.wmnet with reason: host reimage [20:15:28] i'm in UTC-10 at the moment, so it's still early in my day :) [20:15:34] ahh, that makes sense :) [20:16:55] (03CR) 10Bking: [C:03+1] opensearch-cluster: Retrospectively update the CHANGELOG [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306762 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [20:17:32] !log ebernhardson@deploy1003 chlod, ebernhardson, ariel: Backport for [[gerrit:1306414|Revert^3 "cirrus: AB test query suggester variants" (T407432)]], [[gerrit:1306210|Revert "nlwiki: change to Wikipedia 25 logo" (T424519)]], [[gerrit:1306739|ExtensionDistributor: mark 1.46 as stable (T423272)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:17:39] T407432: Follow-up AB test of dym language model variants - https://phabricator.wikimedia.org/T407432 [20:17:40] T424519: Per community Rfc, for the month of June please change to the birthday logo for nl.wikipedia.org - https://phabricator.wikimedia.org/T424519 [20:17:40] T423272: Mark REL1_46 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T423272 [20:17:41] checking [20:18:10] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2101.codfw.wmnet with OS trixie [20:18:15] fine for me (check is "didn't break the wikis") [20:18:30] all good for me :) [20:18:34] awesome, continuing [20:18:37] !log ebernhardson@deploy1003 chlod, ebernhardson, ariel: Continuing with deployment [20:19:12] RECOVERY - SSH on urldownloader1005 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:23:12] (03CR) 10Btullis: [C:03+2] opensearch-cluster: Retrospectively update the CHANGELOG [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306762 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [20:23:57] RESOLVED: [2x] ProbeDown: Service urldownloader1005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:07] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306414|Revert^3 "cirrus: AB test query suggester variants" (T407432)]], [[gerrit:1306210|Revert "nlwiki: change to Wikipedia 25 logo" (T424519)]], [[gerrit:1306739|ExtensionDistributor: mark 1.46 as stable (T423272)]] (duration: 13m 43s) [20:25:15] T407432: Follow-up AB test of dym language model variants - https://phabricator.wikimedia.org/T407432 [20:25:16] T424519: Per community Rfc, for the month of June please change to the birthday logo for nl.wikipedia.org - https://phabricator.wikimedia.org/T424519 [20:25:16] T423272: Mark REL1_46 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T423272 [20:26:12] cscott: should be available now [20:26:34] thanks for the deploy ebernhardson! :D [20:26:42] yw! [20:26:43] (03Merged) 10jenkins-bot: opensearch-cluster: Retrospectively update the CHANGELOG [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306762 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [20:27:10] works great in prod, have a nice day. I'll still be reachable in here as usual etc [20:27:14] 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#12073531 (10cmooney) We now have the //CoreRouterInterfaceDropPercent// alerts, which to an extent covers this scenario. It's not exactly the same, but we are in... [20:27:15] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2101.codfw.wmnet with OS trixie [20:27:17] (03CR) 10Santiago Faci: [C:03+2] Add Test Kitchen config for draft validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306757 (https://phabricator.wikimedia.org/T429420) (owner: 10Clare Ming) [20:27:33] ebernhardson: ok, thanks! [20:27:54] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.4.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306759 (https://phabricator.wikimedia.org/T406576) (owner: 10Clare Ming) [20:29:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306746 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [20:29:39] (03Merged) 10jenkins-bot: Add Test Kitchen config for draft validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306757 (https://phabricator.wikimedia.org/T429420) (owner: 10Clare Ming) [20:30:09] (03Merged) 10jenkins-bot: Turn on Parsoid Read views for 50% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306746 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [20:30:15] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.4.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306759 (https://phabricator.wikimedia.org/T406576) (owner: 10Clare Ming) [20:30:35] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1306746|Turn on Parsoid Read views for 50% of English Wikipedia desktop traffic (T430194)]] [20:30:41] T430194: Parsoid Read Views deploy to English Wikipedia (enwiki) June 25-June 30 - https://phabricator.wikimedia.org/T430194 [20:31:53] (03PS1) 10Btullis: datahub-next: size the OpenSearch securityconfig pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306767 (https://phabricator.wikimedia.org/T402408) [20:32:03] (03CR) 10CI reject: [V:04-1] datahub-next: size the OpenSearch securityconfig pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306767 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [20:32:08] (03PS2) 10Btullis: datahub-next: size the OpenSearch securityconfig pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306767 (https://phabricator.wikimedia.org/T402408) [20:32:44] !log cscott@deploy1003 cscott: Backport for [[gerrit:1306746|Turn on Parsoid Read views for 50% of English Wikipedia desktop traffic (T430194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:33:46] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2113.codfw.wmnet with OS trixie [20:34:55] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [20:35:01] (03CR) 10Btullis: [C:03+2] datahub-next: size the OpenSearch securityconfig pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306767 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [20:35:17] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [20:36:00] (03CR) 10JHathaway: Add sre.hosts.bmc-user-mgmt.py (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [20:36:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2089.codfw.wmnet with OS trixie [20:36:40] !log cscott@deploy1003 cscott: Continuing with deployment [20:36:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 06Discovery-Search (2026.06.01 - 2026.07.03): cirrussearch2089.codfw.wmnet fails reimage cookbook (DRAC error)? - https://phabricator.wikimedia.org/T430726#12073601 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage sta... [20:37:09] (03Merged) 10jenkins-bot: datahub-next: size the OpenSearch securityconfig pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306767 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [20:38:58] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [20:39:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [20:39:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 06Discovery-Search (2026.06.01 - 2026.07.03): cirrussearch2089.codfw.wmnet fails reimage cookbook (DRAC error)? - https://phabricator.wikimedia.org/T430726#12073618 (10Jhancock.wm) @bking the main board was replaced in T399943... [20:41:00] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306746|Turn on Parsoid Read views for 50% of English Wikipedia desktop traffic (T430194)]] (duration: 10m 24s) [20:41:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [20:41:05] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [20:41:05] T430194: Parsoid Read Views deploy to English Wikipedia (enwiki) June 25-June 30 - https://phabricator.wikimedia.org/T430194 [20:41:16] (03CR) 10Jdlrobson: [C:04-1] "Let's deploy Monday 6th July. -1ing until then. I can backport this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305921 (https://phabricator.wikimedia.org/T417638) (owner: 10Bernard Wang) [20:46:50] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2101.codfw.wmnet with reason: host reimage [20:50:22] (03PS4) 10Gergő Tisza: Remove security-related log hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306141 (https://phabricator.wikimedia.org/T430564) [20:50:52] (03CR) 10Gergő Tisza: "(removing Depends-On: I99731df619b67dc7bd98643197dd8aa9d5d600f8 as Spiderpig is being annoying about it)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306141 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [20:52:03] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2113.codfw.wmnet with reason: host reimage [20:52:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306735 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [20:52:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306737 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [20:52:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306141 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [20:53:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [20:53:06] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [20:53:55] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2101.codfw.wmnet with reason: host reimage [20:54:25] (03Merged) 10jenkins-bot: Remove security-related log hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306141 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [20:54:42] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:et-0/0/0 (Transport: Hurricane Electric (dc4841.mrs1) {#changeme_drmrs_he_cct}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:55:03] (03Merged) 10jenkins-bot: SecurityLogs: Create by moving code from mediawiki-config [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306735 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [20:55:05] (03Merged) 10jenkins-bot: SecurityLogs: Add tests [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1306737 (https://phabricator.wikimedia.org/T430564) (owner: 10Gergő Tisza) [20:55:23] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [20:55:26] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [20:55:33] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1306735|SecurityLogs: Create by moving code from mediawiki-config (T430564)]], [[gerrit:1306737|SecurityLogs: Add tests (T430564)]], [[gerrit:1306141|Remove security-related log hooks (T430564)]] [20:55:39] T430564: Move security logging from mediawiki-config to WikimediaCustomization - https://phabricator.wikimedia.org/T430564 [20:57:11] (03PS5) 10JHathaway: Add find_accounts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) [20:57:39] !log tgr@deploy1003 tgr: Backport for [[gerrit:1306735|SecurityLogs: Create by moving code from mediawiki-config (T430564)]], [[gerrit:1306737|SecurityLogs: Add tests (T430564)]], [[gerrit:1306141|Remove security-related log hooks (T430564)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:58:03] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2113.codfw.wmnet with reason: host reimage [20:58:06] (03CR) 10JHathaway: Add find_accounts (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) (owner: 10JHathaway) [20:58:41] (03PS6) 10JHathaway: redfish: add find_accounts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260630T2100) [21:04:49] (03PS1) 10Andrew Bogott: magnum policy.yaml: replace rule:admin_or_user with rule:admin_or_member [puppet] - 10https://gerrit.wikimedia.org/r/1306775 (https://phabricator.wikimedia.org/T430680) [21:06:53] Removing the NodeJS iPoid service [21:08:59] !log tgr@deploy1003 tgr: Continuing with deployment [21:10:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 06Discovery-Search (2026.06.01 - 2026.07.03): cirrussearch2089.codfw.wmnet fails reimage cookbook (DRAC error)? - https://phabricator.wikimedia.org/T430726#12073749 (10bking) 05Open→03Resolved a:03bking Excellent! I ju... [21:11:12] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [21:13:18] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1306735|SecurityLogs: Create by moving code from mediawiki-config (T430564)]], [[gerrit:1306737|SecurityLogs: Add tests (T430564)]], [[gerrit:1306141|Remove security-related log hooks (T430564)]] (duration: 17m 45s) [21:13:22] T430564: Move security logging from mediawiki-config to WikimediaCustomization - https://phabricator.wikimedia.org/T430564 [21:14:23] !log Destroyed NodeJS iPoid service deployments for T416623 [21:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:27] T416623: Decommission NodeJS IPoid service - https://phabricator.wikimedia.org/T416623 [21:14:37] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2101.codfw.wmnet with OS trixie [21:15:48] !log UTC late deploysdone [21:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:13] cmooney@cumin1003 netbox (PID 185728) is awaiting input [21:17:41] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new link IP dns for trasnport circuits to drmrs - cmooney@cumin1003" [21:18:48] (03PS1) 10Dreamy Jazz: ipoid: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/1306779 (https://phabricator.wikimedia.org/T416623) [21:19:19] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2113.codfw.wmnet with OS trixie [21:19:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new link IP dns for trasnport circuits to drmrs - cmooney@cumin1003" [21:19:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:19:41] (03PS1) 10Cathal Mooney: Reverse PTR INCLUDE statements for new transport linknets in drmrs [dns] - 10https://gerrit.wikimedia.org/r/1306780 (https://phabricator.wikimedia.org/T412537) [21:20:21] !log bking@cumin2003 START - Cookbook sre.hosts.reimage for host cirrussearch2071.codfw.wmnet with OS trixie [21:20:39] (03CR) 10CI reject: [V:04-1] Reverse PTR INCLUDE statements for new transport linknets in drmrs [dns] - 10https://gerrit.wikimedia.org/r/1306780 (https://phabricator.wikimedia.org/T412537) (owner: 10Cathal Mooney) [21:23:22] (03PS2) 10Cathal Mooney: Reverse PTR INCLUDE statements for new transport linknets in drmrs [dns] - 10https://gerrit.wikimedia.org/r/1306780 (https://phabricator.wikimedia.org/T412537) [21:23:55] (03PS1) 10Dreamy Jazz: deployment_server: remove ipoid users [puppet] - 10https://gerrit.wikimedia.org/r/1306782 [21:24:17] (03PS2) 10Dreamy Jazz: deployment_server: absent ipoid kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/1306779 (https://phabricator.wikimedia.org/T416623) [21:24:22] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306779 (https://phabricator.wikimedia.org/T416623) (owner: 10Dreamy Jazz) [21:24:28] (03PS2) 10Dreamy Jazz: deployment_server: remove ipoid users [puppet] - 10https://gerrit.wikimedia.org/r/1306782 [21:24:54] (03PS3) 10Dreamy Jazz: deployment_server: remove ipoid users [puppet] - 10https://gerrit.wikimedia.org/r/1306782 (https://phabricator.wikimedia.org/T416623) [21:25:13] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306782 (https://phabricator.wikimedia.org/T416623) (owner: 10Dreamy Jazz) [21:25:44] (03CR) 10Cathal Mooney: [C:03+2] Reverse PTR INCLUDE statements for new transport linknets in drmrs [dns] - 10https://gerrit.wikimedia.org/r/1306780 (https://phabricator.wikimedia.org/T412537) (owner: 10Cathal Mooney) [21:25:50] !log cmooney@dns3003 START - running authdns-update [21:27:57] !log cmooney@dns3003 END - running authdns-update [21:35:16] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [21:35:46] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [21:35:47] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [21:36:18] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [21:38:44] !log bking@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2071.codfw.wmnet with reason: host reimage [21:39:42] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/1/2 (Transport: Arelion (IC-398707) {#5249}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:44:45] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2071.codfw.wmnet with reason: host reimage [21:47:57] (03PS1) 10Dreamy Jazz: Remove ipoid chart and service definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306784 (https://phabricator.wikimedia.org/T416623) [21:49:16] (03CR) 10Dreamy Jazz: [C:03+2] .fixtures: remove erroneously committed file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295949 (owner: 10Kamila Součková) [21:49:45] (03CR) 10Dreamy Jazz: [C:03+2] "(Looks like the merge got stuck)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295949 (owner: 10Kamila Součková) [21:50:22] (03PS2) 10Dreamy Jazz: Remove ipoid chart and service definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306784 (https://phabricator.wikimedia.org/T416623) [21:50:52] (03Merged) 10jenkins-bot: .fixtures: remove erroneously committed file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295949 (owner: 10Kamila Součková) [21:51:04] (03Abandoned) 10Dreamy Jazz: ipoid: upgrade to new modules versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/998987 (owner: 10Giuseppe Lavagetto) [21:51:31] (03PS3) 10Dreamy Jazz: Remove ipoid chart and service definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306784 (https://phabricator.wikimedia.org/T416623) [22:04:25] (03PS1) 10Dzahn: cloud: set jenkins_agent_user for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/1306787 (https://phabricator.wikimedia.org/T429636) [22:04:59] (03PS4) 10JHathaway: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1305985 (https://phabricator.wikimedia.org/T372666) [22:06:16] !log bking@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2071.codfw.wmnet with OS trixie [22:27:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:44:42] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:et-0/0/0 (Transport: Hurricane Electric (dc4841.sin1) {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:53:54] !log Deploying Refinery at 4e7a2b32 for changes: pageview allowlist 1305158 (+min.wikiquote) 1305162 (+bol.wikipedia), 1305156 (+isv.wikipedia); 1305980 (pv allowlist -api.wikimedia, sqoop +isvwiki); sqoop 1295064 (+globalimagelinks) 1295069 (+filerevision) [22:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:30] !log dr0ptp4kt@deploy1003 Started deploy [analytics/refinery@4e7a2b3] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4e7a2b32] [22:56:29] !log dr0ptp4kt@deploy1003 Finished deploy [analytics/refinery@4e7a2b3] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4e7a2b32] (duration: 01m 59s) [22:57:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:00:17] !log dr0ptp4kt@deploy1003 Started deploy [analytics/refinery@4e7a2b3]: Regular analytics weekly train [analytics/refinery@4e7a2b32] [23:02:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:02:10] ^ httpbb_kubernetes_mw-api-int_hourly was a transient 503 on one url, I reran it and it's happy [23:02:20] (would have cleared on its own after the next run in an hour anyway) [23:04:28] !log dr0ptp4kt@deploy1003 Finished deploy [analytics/refinery@4e7a2b3]: Regular analytics weekly train [analytics/refinery@4e7a2b32] (duration: 04m 11s) [23:04:48] !log dr0ptp4kt@deploy1003 Started deploy [analytics/refinery@4e7a2b3] (thin): Regular analytics weekly train THIN [analytics/refinery@4e7a2b32] [23:06:49] !log dr0ptp4kt@deploy1003 Finished deploy [analytics/refinery@4e7a2b3] (thin): Regular analytics weekly train THIN [analytics/refinery@4e7a2b32] (duration: 02m 00s) [23:07:33] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2003 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:12:41] !log T429844 [opensearch] cleared stale/redundant cluster.routing.allocation.* transient settings from all production search clusters; also cleared redundant action.auto_create_index and cluster.routing.use_adaptive_replica_selection transients from codfw chi/psi, and stale transient DEBUG logger overrides from codfw chi [23:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:47] T429844: Migrate production OpenSearch clusters from 1.x-2.x - CODFW - https://phabricator.wikimedia.org/T429844 [23:15:08] (03PS1) 10Cwhite: profile: add initial opensearch security plugin config [puppet] - 10https://gerrit.wikimedia.org/r/1306792 (https://phabricator.wikimedia.org/T350516) [23:15:43] PROBLEM - SSH on urldownloader2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:16:42] FIRING: [2x] ProbeDown: Service urldownloader2005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:20:30] (03CR) 10Dzahn: [C:03+2] cloud: set jenkins_agent_user for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/1306787 (https://phabricator.wikimedia.org/T429636) (owner: 10Dzahn) [23:30:14] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2103.codfw.wmnet with OS trixie [23:31:27] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2102.codfw.wmnet with OS trixie [23:41:38] RECOVERY - SSH on urldownloader2005 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:42:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306795 [23:42:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306795 (owner: 10TrainBranchBot) [23:44:42] PROBLEM - SSH on urldownloader2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:49:38] RECOVERY - SSH on urldownloader2005 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:50:08] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2103.codfw.wmnet with reason: host reimage [23:51:38] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2102.codfw.wmnet with reason: host reimage [23:52:42] PROBLEM - SSH on urldownloader2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:54:06] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2103.codfw.wmnet with reason: host reimage [23:57:59] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2102.codfw.wmnet with reason: host reimage